LSF/MM 2015 recap.

It’s been a long week.
Spent Monday/Tuesday at LSFMM. This year it was in Boston, which was convenient in that I didn’t have to travel anywhere, but less convenient in that I had to get up early and do a rush-hour commute to get to the conference location in time. At least the weather got considerably better this week compared to the frankly stupid amount of snow we’ve had over the last month.
LWN did their usual great write-up which covers everything that was talked about in a lot more detail than my feeble mind can remember.

A lot of things from last years event seem to still be getting a lot of discussion. SMR drives & persistent memory being the obvious stand-outs. Lots of discussion surrounding various things related to huge pages (so much so one session overran and replaced a slot I was supposed to share with Sasha, not that I complained. It was interesting stuff, and I learned a few new reasons to dislike the way we handle hugepages & forking), and I lost track how many times the GFP_NOFAIL discussion came up.

In a passing comment in one session, one of the people Intel sent (Dave Hansen iirc) mentioned that Intel are now shipping a 18 core/36 thread CPU. A bargain at just $4642. Especially when compared to this madness.

A few days before the event, I had been asked if I wanted to do a “how Akamai uses Linux” type talk at LSFMM, akin to what Chris Mason did re: facebook at last years event. I declined, given I’m still trying to figure that out myself. Perhaps another time.

Wednesday/Thursday, I attended Vault at the same location.
My take-away’s:

  • There was generally a lot more positive vibes around btrfs this year. Even with Josef playing bad cop to Chris’ good cop talk, things generally seemed to be moving away from a “everything is awful” toward “this actually works…” though with the qualifier “.. for facebook’s workload”. Josef did touch on one area that btrfs does still suck, which apparently is database workloads (iirc, due to the copy-on-write nature of btrfs). The spurious ENOSPC failures of the past should hopefully stay in the past. Things generally on the up and up. (Though, this does include the linecount, which has now passed 100KLOC, more than double that of XFS or ext*. Scary).
  • Equally positive vibes surrounding XFS. We celebrated the 20 year anniversary at one evening event, making us all feel just that little bit more like an old fart club. Interesting talk toward the end by Dave Chinner about the future of XFS, and how the current surge of development in XFS is probably its last for various scaling reasons as disks continue to get bigger and bigger. Predicting the future is always hard, but if what Dave said was true, things will start to get ‘interesting’ in about 5 years time, given every other filesystem we support in Linux has the same issues (or worse).
  • People still care a lot about NFS. Especially pNFS. Surprising amount of activity still happening.
  • Even when I worked there, I never really got Red Hat’s “big picture” wrt the several distributed filesystems they supported. Now that I’m not there, I feel even more out of the loop. “ceph is the way forward” “except when it’s glusterfs” or something. Oh, and GFS2 is still a thing apparently, for some reason.
  • As entertaining as Jeremy Allison might be, don’t go to a talk on Samba internals unless you work on it (in which case it’s too late for you). The horrors will likely keep you up at night.
  • Ted’s ext4 talk drew a decent crowd. As fancy as btrfs/xfs etc might be, a *lot* of people still give a crap about extN. Somehow I missed the addition of the ‘lazytime’ option to ext4. Seems neat. Played with it (and also the super-secret ‘dioread_nolock’ mount option). Saw another talk on orphan list scalability in ext4, which was interesting, but didn’t draw as big a crowd.

I got asked “What are you doing at Akamai ?” a lot. (answer right now: trying to bring some coherence to our multiple test infrastructures).
Second most popular question: “What are going to do after that ?”. (answer: unknown, but likely something more related to digging into networking problems rather than fighting shell scripts, perl and Makefiles).

All that, plus a lot of hallway conversations, long lunches, and evening activities that went on possibly a little later than they should have have led to me almost losing my voice today.
Really good use of time though. I had fun, and it’s always good to catch up with various people.

Trinity 1.5 release.

As announced this morning, today I decided that things had slowed down (to an almost standstill of late) enough that it was worth making a tarball release of Trinity, to wrap up everything that’s gone in over the last year.

The email linked above covers most of the major changes, but a lot of the change over the last year has actually been groundwork for those features. Things like..

  • The post-mortem dumper needed the generation of the text and the writing to log files to be decoupled, which wasn’t particularly trivial.
  • Some features involved considerable rewrites. The fd generators are now pretty much isolated from each other, making adding a new one a simple task.
  • Handling of the mapping structs got a lot of cleanup (though there is definitely still a lot of room for improvement there, especially when we do things like splitting a mapping).
  • I should also mention the countless hours spent chasing down quite a few hard-to-reproduce bugs that are fixed in 1.5

As I mentioned in the announcement, I don’t see myself having a huge amount of time for at least this year to work on Trinity. I’ve had a number of people email me asking the status of some feature. Hopefully this demarkation point will answer the question.

So, it’s not abandoned, it just won’t be seeing the volume of change it has over the last few years. I expect my personal involvement will be limited to merging patches, and updating the syscall lists when new syscalls get added.

Trinity used to be on roughly a six month release schedule. We’ll see if by the end of the year there’s enough input from other people to justify doing a 1.6 release.

I’m also hopeful that time working on other projects mean I’ll come back to this at some point with fresh eyes. There are a number of features I wanted to implement that needed a lot more thought. Perhaps working on some other things for a while will give me the perspective necessary to realize those features.

backup solutions.

For the longest time, my backup solution has been a series of rsync scripts that have evolved over time into a crufty mess. Having become spoiled on my mac with time machine, I decided to look into something better that didn’t involve a huge time investment on my part.

The general consensus seemed to be that for ready-to-use home-nas type devices, the way to go was either Synology, or Drobo. You just stick in some disks, and setup NFS/SAMBA etc with a bunch of mouse clicking. Perfect.

I had already decided I was going to roll with a 5 disk RAID6 setup, so bit the bullet and laid down $1000 for a Synology 8-Bay DS1815+. It came *triple* boxed, unlike the handful of 3TB HGST drives.
I chose the HGST’s after reading backblaze’s report on failure rates across several manufacturers, and figured that after the RAID6 overhead, 8TB would be more than enough for a long time, even at the rate I accumulate flac and wav files. Also, worst case, I still had 3 spare bays I could expand into later if needed.

Installation was a breeze. The plastic drive caddies felt a little flimsy, but the drives were secure once in them, even if they did feel like they were going to snap as I flexed them to pop them into place. After putting in all the drives, I connected the four ethernet ports, I powered it up.
After connecting to its web UI, it wanted to do a firmware update, like just about every internet connected device wants to do these days. It rebooted, and finally I could get about setting things up.

On first logging into the device over ssh, I think the first command I typed was uname. Seeing a 3.2 kernel surprised me a little. I got nervous thinking about how many VFS,EXT4,MD bugfixes hadn’t made their way back to long-term stable, and got the creeps a little. I decided to not think too much about it, and put faith in the Synology people doing backports (though I never got as far as looking into their kernel package).

The web ui is pretty slick, though felt a little sluggish at times. I set up my RAID6 volume with a bunch of clicks, and then listened as all those disks started clattering away. After creation, it wanted to do an initial parity scan. I set it going, and went to bed. The next morning before going to work, I checked on it, and noticed it wasn’t even at 20% done. I left it going while I went into the office the next day. I spent the night away from home, and so didn’t get back to it until another day later.

When I returned home, the volume was now ready, but I noticed the device was now noticeably hotter to touch than I remembered. I figured it had been hammering the disks non-stop for 24hrs, so go figure, and that it would probably cool off a little as it idled. As the device was now ready for exporting, I set up an nfs export, and then spent some time fighting uid mappings, as you do. The device does have ability to deal with LDAP and some other stuff that I’ve never had time to setup, so I did things the hard way. Once I had the export mounted, I started my first rsync from my existing backups.

While it was running, I remembered I had intended to set up bonding. A little bit of clicky-clicky later, it was done, and transfers started getting even faster. Very nice. I set up two bonds, with a pair of NICs in each. Given my desktop only has a dual NIC, that was good enough. Having a 2nd 2GigE bond I figured was nice in case I had multiple machines wanting to use it while I was doing a backup.

So the backup was going to take a while, so I left it running.
A few hours later, I got back to it, and again, it was getting really hot. There are two pretty big fans in the back of the units, and they were cranking out heat. Then, things started getting really weird. I noticed that the rsync had hung. I ctrl-c’d it, and tried logging into the device as root. It took _minutes_ to get a command prompt. I typed top and waited. About two minutes later top started. Then it spontaneously rebooted.

When it came back up, I logged in, and poked around the log files, and didn’t see anything out of the ordinary.
I restarted the rsync, and left it go for a while. About 20 minutes later, I came back to check on it again, and found that the box had just hung completely. The rsync was stalled, I couldn’t ssh in. I rebooted the device, cursed a bit, and then decided to think about it for a while, so never restarted the rsync. I clicked around in the interface, to see if there was anything I could turn on/off that would perhaps give me some clues wtf was going on.
Then it rebooted spontaneously again.

It was about this time I was ready to throw the damn thing out the window. I bought this thing because I wanted a turn-key solution that ‘just worked’, and had quickly come to realize that with this device when something went bad, I was pretty screwed. Sometimes “It runs Linux” just isn’t enough. For some people, the Synology might be a great solution, but it wasn’t for me. Reading some of the Amazon reviews, it seems there were a few people complaining about their units overheating, which might explain the random reboots I saw. For a device I wanted to leave switched on 24/7 and never think about, something that overheats (especially when I’m not at home) really doesn’t give me feel good vibes. Some of the other reviews on Amazon rave about the DS1815+. It may be that there was a bad batch, and I got unlucky, but I felt burnt on the whole experience, and even if I had got a replacement, I don’t know if I would have felt like I could have trusted this thing with my data.

I ended up returning it to Amazon for a refund, and used the money to buy a motherboard, cpu, ram etc to build a dedicated backup computer. It might not have the fancy web ui, and it might mean I’ll still be using my crappy rsync scripts, but when things go wrong, I generally have a much better chance of fixing the problems.

Other surprises: At one point, I opened the unit up to install an extra 4GB of RAM (It comes with just 2GB by default), I noticed that it runs off a single 250W power supply, which seemed surprising to me. I thought disks during spin-up used considerably more power, but apparently they’re pretty low power these days.

So, two weeks of wasted time, frustration, and failed experiments. Hopefully by next week I’ll have my replacement solution all set up and can move on to more interesting things instead of fighting appliances.

So much to learn..

At lunch a few days ago, I was discussing with a coworker how right now I’m feeling a little overwhelmed. Not completely unsurprising I suppose given the volume of new things I need to learn. But as I’m getting acclimated in my new job, it’s becoming clearer which things I either don’t know well enough or am completely unfamiliar with.

I’m only now realizing that not only the scope of what I’m working on changed, but how I work on things. During the ten years I worked on Fedora, because we were woefully understaffed and buried alive in bugs, there was never really time to spend a lot of time on an individual bug, unless a lot of users were hitting it. Because of this, the things that I ended up fixing were usually fairly small fixes. The occasional NULL pointer deference. Perhaps a use after free. Maybe linked-list corruption. Pretty basic stuff that anyone with familiarity with any part of the kernel could figure out reasonably quickly. Do enough of this, and it becomes less about engineering, and more about pattern recognition.

Even a lot of the bugs that trinity finds aren’t terribly invasive fixes.
(Screwed up error paths being probably the most common thing it picks up, because nothing else really ever tests them).

The more complicated bugs, like a WARN_ON being hit ? That could take a lot more understanding that could take a considerable time to get to.

As Fedora kernels were so close to mainline a lot of the time we could chase up the maintainer, pass the bug along, and move onto something else. The result of ten years of this way of working has meant I have a passing understanding how lots of parts of the kernel interact from a 10,000ft perspective, and perhaps some understanding of some nuances based on past interactions, but no deep architectural understanding of for eg: how an sk_buff traverses through the various parts of the network stack. A warning deep in the tcp guts, caused by a packet that’s been through bonding, netfilter, etc ? I’m not your guy. At least today.

Thankfully I’ve got some time to ramp up my learning on unfamiliar parts of the kernel, and get a better understanding of how things work on a deeper level. I suspect I’ll end up turning at least some of the stuff I learn into future posts here.

In the meantime, I’ve got a lot of reading to do.
I felt a little reassured at least when my coworker responded “yeah, me too”.

Thoughts on long-term stable kernels.

I remembered something that I found eye-opening while interviewing/phone-screening over the last few months.

The number of companies that base their systems (especially those that don’t actually distribute their kernels outside the company) on the long-term stable releases of Linux caught me by surprise. We dismissed the idea of basing on long-term stable releases in Fedora after giving it a try circa Fedora 14, and it generally being a disaster because the bugs being fixed didn’t match up much to the bugs our users were seeing. We found that we got more bugs we cared about being fixed by sticking to the rolling model that Fedora is known for.

After discussing this with several potential employers, I now have a different perspective on this, and can see the appeal if you have a limited use-case, and only care about a small subset of hardware.

The general purpose “one size fits all” model that distribution kernels have to fit is a much bigger problem to solve, and with the feedback loop of “stable release -> bug -> report upstream -> fixed in mainline -> backport to next stable release” being so long, it’s not really a surprise that just having users be closer to bleeding edge gets a higher volume of bugs fixed (though at the cost of all the shiny new bugs that get introduced along the way) unless you have a RHEL-like small army of developers backporting fixes.

Finally, nearly everyone I talked to who uses long-term stable was also carrying some ‘extras’, such as updated drivers for hardware they cared about (in some cases, out of tree variants that aren’t even synced with linux-next yet).

It’s a complicated mess out there. “We run linux stable” doesn’t necessarily tell you the whole picture.

the new job reveal.

I let the cat out of the bag earlier this afternoon on twitter. Next Monday is day one of my new job at Akamai. The scope of my new role is pretty wide, ranging from the usual kernel debugging type work, to helping stabilizing production releases, proactively finding new bugs, misc QA work, development of some new tooling and a whole bunch of other stuff I can’t talk about just yet. (And probably a whole slew of things I don’t even know about yet).

It seemed almost serendipitous that I’ve ended up here. Earlier this year, I read Fatal System Error, a book detailing how Prolexic got founded. It’s a fascinating story beginning with DDoS’s of offshore gambling sites and ending with Russian organised crime syndicates. I don’t know how much of it got embellished for the book, but it’s a good read all the same. Also, Amazon usually has used copies for 1 cent. Anyway, a month after I read that book, Akamai acquired Prolexic.

Shortly afterwards, I found myself reading another Akamai related book, No Better Time, the autobiography of the late founder of Akamai, Danny Lewin. After reading it, I decided it was at least going to be worth interviewing there.

The job search led me a few possibilities, and the final decision to go with Akamai wasn’t an easy one to make. The combination of interesting work, an “easy to commute to” office (I’ll be at the Kendall square office in Cambridge,MA) and a small team that seemed easy to get along with (famous last words) is what decided it. (That, and all the dubstep).

It’s going to be an interesting challenge ahead to switch from the mindset of “bug that affects a single computer, or a handful of users” to “something that could bring down a 200,000 node cluster”, but I think I’m up for it. One thing I am definitely looking forward to is only caring about contemporary hardware, and a limited number of platforms.

I apologize in advance for any unexpected internet outages. I swear it was like that when I found it.

Some closure on a particularly nasty bug.

For the final three months of my tenure at Red Hat, I was chasing what was possibly the most frustrating bug I’d encountered since I had started work there. I had been fixing up various bugs in Trinity over the tail end of summer that meant on a good kernel, it would run and run for quite some time. I still didn’t figure out exactly what the cause of a self-corruption was, but narrowed it down enough that I could come up with a workaround. With this workaround in place, every so often, the kernels NMI watchdog would decide that a process had wedged, and eventually the box would grind to a halt. Just to make it doubly annoying, the hang would happen at seeming random intervals. Sometimes I could trigger it within an hour, sometimes it would take 24 hours.

I spent a lot of time trying to narrow down exactly the circumstances that would trigger this, without much luck. A lot of time was wasted trying to bisect where this was introduced, based upon bad assumptions that earlier kernels were ‘good’. During all this, a lot of theories were being thrown around, and people started looking at multiple areas of the kernel. The watchdog code, the scheduler, FPU context saving, the page fault handling, and time management. Along the way several suspect areas were highlighted, some things got fixed/cleaned up, but ultimately, they didn’t solve the problem. (A number of other suspect areas of code were highlighted that don’t have commits yet).

Then, right down to the final week before I gave all my hardware back to Red Hat, Linus managed to reproduce similar symptoms, by scribbling directly to the HPET. He came up with a hack that at least made the kernel survive for him. When I tried the same patch, the machine ran for three days before I interrupted it. The longest it had ever run.

The question remains, what was scribbling over the HPET in my case ? The /dev/hpet node doesn’t allow writing, even as root. You can mmap /dev/mem if you know the address of the HPET, and directly write to it, but..
1. That would be a root-only possibility, and..
2. Trinity blacklists /dev/mem, and never touches it.

The only two plausible scenarios I could think of were

  • Trinity generated 0xfed000f0 as a random address, and passed that to a syscall which wrote to it. This seems pretty unlikely, and hopefully the kernel has sufficient access_ok() checks on addresses passed in from userspace. Just to be sure, I had hardwired trinity to pass in that address, and couldn’t reproduce the bug.
  • A hardware bug.
    I’m actually starting to believe this may be the case. When trinity drives the CPU load up past a certain threshold, for whatever reason, the HPET stops ticking and corrupts itself. It still seems a bit “out there”, but is more believable than the other theory at least. An interesting data point showed up when googling for the DMI string of the affected machine. Someone else had seen ‘random lockups’ that looked very similar a year earlier. The associated bugzilla had a few more traces.

So that’s where the story (mostly) ends. When I left Red Hat, I gave that (possibly flawed) machine back. Linus’ hacky workaround didn’t get committed, but him & John Stultz continue to back & forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we’ll see something real get committed there.

It was an interesting (though downright annoying) bug that took a lot longer to get any kind of closure on than expected. Some things I learned from this experience:

  • Keep better notes.
    Every week that passed, I had wished I wrote down what I had done the week before. With everything else going on in my life over the last few months, I neglected to document things as well as I could have, and only had old emails to fall back on. Not every bug drags on for months like this, but when you over-optimistically think a bug is going to be solved in a few days, you tend to not bother taking as extensive notes on what has been tried so far.
  • Google for the DMI string of the affected hardware pretty early on.
    That might have given us some clues a lot sooner as to what was going on. Or maybe not, but still – more data.
  • The more people looked into this bug, the more “this doesn’t look right” code was found. There’s never just one bug.

continuity of various projects

I’ve had a bunch of people emailing me asking how my new job will affect various things I’ve worked on over the last few years. For the most part, not massively, but there are some changes ahead.

  • Trinity.
    This will proceed pretty much as it has over the last year, perhaps with a little more focus on various areas of the kernel than it has in the past. (Not being intentionally elusive, I’m not entirely sure myself yet).
    From discussions I’ve had so far, there may even be a spin-off into a separate tool, we’ll see.
    • I’ve been wanting to get back to the network related code for some time, and that’s probably going to happen soon-ish.
    • There is still work that could be done with the VM/FS code I’ve worked on over the last year and a half, but it’s already finding bugs, so is “good enough” for a while.
    • There are also still a few lingering bugs that I need to one day sit down and figure out, but they happen so infrequently that I’ve not found the time so far.
    • Finally, there is also a bunch of other feature work that needs fleshing out that I don’t see myself getting to any time soon, that I’ll dump in the TODO over the next week.
  • Upstream testing.
    Things like my daily running of various stress tools against Linus’ master branch with debug builds won’t happen to the level they were previously. The good news is that instead I’ll be doing a lot more testing on various stable branches, which I never did a whole lot of in the past. I expect over time as I get a better feel for my workload I might be able to ramp back up master testing somewhat, but it will be a secondary thing that I do in addition to everything else. Right now, I’m not really doing any of this, so other people running things like trinity against 3.19/3.20 would probably be a good idea if you like collecting crashes, at least until I find my feet again.
  • Coverity scans.
    • I’ll keep doing the scan.coverity runs hopefully once per -rc. I’ve lapsed a little right now, but will pick this back up again soon to get things back up to current. It takes a lot longer to run a whole scan from home now that I don’t have access to my 24-thread Nehalem (was: < 1 hours for compile, pack & upload to coverity, now: ~4 hours just for the compile), so I'll automate these to run overnight when Linus makes a new snapshot, at least until I put together something a little faster than my ~7 year old core duo.
    • I’m not sure I’m going to have a huge amount of time for triage work yet, so we’ll see how that works out. I might have to focus just on certain areas.
    • There will be some element of related work in my new job, but I’m not sure to what degree yet, more info on that as I figure it out.
  • Fedora.
    The biggest change of all.
    • I’m just not going to have to maintain packages, read mail etc for Fedora, so those all got orphaned yesterday.
    • Josh & Justin pretty much handled all of the Fedora kernel work for the last year or so, so me walking away is not going to make a huge difference there.
    • I might still occasionally take a peek at Fedora bugzilla to see if there’s anything similar to a particular bug, but don’t expect to be doing triage work.
    • I’ll still keep a Fedora box or two at home for a while, but work-wise, I’m expecting a lot more Debian in my life. It’s been over a decade since I last used it seriously. That should prove to be fun.
  • Conferences etc.
    I’ll hopefully be seeing some familiar faces again later this year, and possibly meeting some new ones. The only real change here will be the lack of fudcon/flock/whatever it’s called this week.

So that’s about it. For at least the first few months of 2015, I expect to be absorbed in getting acclimated my new job, so I won’t be as visible as usual, but I’m not going to disappear from the Linux community forever, which was something I made clear wasn’t something I wanted to happen at everywhere I interviewed.

Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.


In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

  • Our insistence on shipping the latest code, with as few 'special sauce' patches won over a lot of upstream developers that wouldn't have given us the time of day for similar bugs back in the RHL days. Sometimes painful for our users, but Linux as a whole got better because of our stance here.
  • Decisions like Fedora enabling debug options by default in betas shook out an unbelievable number of bugs almost as soon as they get introduced. Again, painful for users, but from a quality standpoint, we found a ton of bugs in code others were racing to ship first and call "enterprise ready".
  • Fedora enabling features sometimes before they were fully baked got us a lot of love from their respective upstream maintainers.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.