Thoughts on long-term stable kernels.

I remembered something that I found eye-opening while interviewing/phone-screening over the last few months.

The number of companies that base their systems (especially those that don’t actually distribute their kernels outside the company) on the long-term stable releases of Linux caught me by surprise. We dismissed the idea of basing on long-term stable releases in Fedora after giving it a try circa Fedora 14, and it generally being a disaster because the bugs being fixed didn’t match up much to the bugs our users were seeing. We found that we got more bugs we cared about being fixed by sticking to the rolling model that Fedora is known for.

After discussing this with several potential employers, I now have a different perspective on this, and can see the appeal if you have a limited use-case, and only care about a small subset of hardware.

The general purpose “one size fits all” model that distribution kernels have to fit is a much bigger problem to solve, and with the feedback loop of “stable release -> bug -> report upstream -> fixed in mainline -> backport to next stable release” being so long, it’s not really a surprise that just having users be closer to bleeding edge gets a higher volume of bugs fixed (though at the cost of all the shiny new bugs that get introduced along the way) unless you have a RHEL-like small army of developers backporting fixes.

Finally, nearly everyone I talked to who uses long-term stable was also carrying some ‘extras’, such as updated drivers for hardware they cared about (in some cases, out of tree variants that aren’t even synced with linux-next yet).

It’s a complicated mess out there. “We run linux stable” doesn’t necessarily tell you the whole picture.

the new job reveal.

I let the cat out of the bag earlier this afternoon on twitter. Next Monday is day one of my new job at Akamai. The scope of my new role is pretty wide, ranging from the usual kernel debugging type work, to helping stabilizing production releases, proactively finding new bugs, misc QA work, development of some new tooling and a whole bunch of other stuff I can’t talk about just yet. (And probably a whole slew of things I don’t even know about yet).

It seemed almost serendipitous that I’ve ended up here. Earlier this year, I read Fatal System Error, a book detailing how Prolexic got founded. It’s a fascinating story beginning with DDoS’s of offshore gambling sites and ending with Russian organised crime syndicates. I don’t know how much of it got embellished for the book, but it’s a good read all the same. Also, Amazon usually has used copies for 1 cent. Anyway, a month after I read that book, Akamai acquired Prolexic.

Shortly afterwards, I found myself reading another Akamai related book, No Better Time, the autobiography of the late founder of Akamai, Danny Lewin. After reading it, I decided it was at least going to be worth interviewing there.

The job search led me a few possibilities, and the final decision to go with Akamai wasn’t an easy one to make. The combination of interesting work, an “easy to commute to” office (I’ll be at the Kendall square office in Cambridge,MA) and a small team that seemed easy to get along with (famous last words) is what decided it. (That, and all the dubstep).

It’s going to be an interesting challenge ahead to switch from the mindset of “bug that affects a single computer, or a handful of users” to “something that could bring down a 200,000 node cluster”, but I think I’m up for it. One thing I am definitely looking forward to is only caring about contemporary hardware, and a limited number of platforms.

I apologize in advance for any unexpected internet outages. I swear it was like that when I found it.

Some closure on a particularly nasty bug.

For the final three months of my tenure at Red Hat, I was chasing what was possibly the most frustrating bug I’d encountered since I had started work there. I had been fixing up various bugs in Trinity over the tail end of summer that meant on a good kernel, it would run and run for quite some time. I still didn’t figure out exactly what the cause of a self-corruption was, but narrowed it down enough that I could come up with a workaround. With this workaround in place, every so often, the kernels NMI watchdog would decide that a process had wedged, and eventually the box would grind to a halt. Just to make it doubly annoying, the hang would happen at seeming random intervals. Sometimes I could trigger it within an hour, sometimes it would take 24 hours.

I spent a lot of time trying to narrow down exactly the circumstances that would trigger this, without much luck. A lot of time was wasted trying to bisect where this was introduced, based upon bad assumptions that earlier kernels were ‘good’. During all this, a lot of theories were being thrown around, and people started looking at multiple areas of the kernel. The watchdog code, the scheduler, FPU context saving, the page fault handling, and time management. Along the way several suspect areas were highlighted, some things got fixed/cleaned up, but ultimately, they didn’t solve the problem. (A number of other suspect areas of code were highlighted that don’t have commits yet).

Then, right down to the final week before I gave all my hardware back to Red Hat, Linus managed to reproduce similar symptoms, by scribbling directly to the HPET. He came up with a hack that at least made the kernel survive for him. When I tried the same patch, the machine ran for three days before I interrupted it. The longest it had ever run.

The question remains, what was scribbling over the HPET in my case ? The /dev/hpet node doesn’t allow writing, even as root. You can mmap /dev/mem if you know the address of the HPET, and directly write to it, but..
1. That would be a root-only possibility, and..
2. Trinity blacklists /dev/mem, and never touches it.

The only two plausible scenarios I could think of were

  • Trinity generated 0xfed000f0 as a random address, and passed that to a syscall which wrote to it. This seems pretty unlikely, and hopefully the kernel has sufficient access_ok() checks on addresses passed in from userspace. Just to be sure, I had hardwired trinity to pass in that address, and couldn’t reproduce the bug.
  • A hardware bug.
    I’m actually starting to believe this may be the case. When trinity drives the CPU load up past a certain threshold, for whatever reason, the HPET stops ticking and corrupts itself. It still seems a bit “out there”, but is more believable than the other theory at least. An interesting data point showed up when googling for the DMI string of the affected machine. Someone else had seen ‘random lockups’ that looked very similar a year earlier. The associated bugzilla had a few more traces.

So that’s where the story (mostly) ends. When I left Red Hat, I gave that (possibly flawed) machine back. Linus’ hacky workaround didn’t get committed, but him & John Stultz continue to back & forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we’ll see something real get committed there.

It was an interesting (though downright annoying) bug that took a lot longer to get any kind of closure on than expected. Some things I learned from this experience:

  • Keep better notes.
    Every week that passed, I had wished I wrote down what I had done the week before. With everything else going on in my life over the last few months, I neglected to document things as well as I could have, and only had old emails to fall back on. Not every bug drags on for months like this, but when you over-optimistically think a bug is going to be solved in a few days, you tend to not bother taking as extensive notes on what has been tried so far.
  • Google for the DMI string of the affected hardware pretty early on.
    That might have given us some clues a lot sooner as to what was going on. Or maybe not, but still – more data.
  • The more people looked into this bug, the more “this doesn’t look right” code was found. There’s never just one bug.

continuity of various projects

I’ve had a bunch of people emailing me asking how my new job will affect various things I’ve worked on over the last few years. For the most part, not massively, but there are some changes ahead.

  • Trinity.
    This will proceed pretty much as it has over the last year, perhaps with a little more focus on various areas of the kernel than it has in the past. (Not being intentionally elusive, I’m not entirely sure myself yet).
    From discussions I’ve had so far, there may even be a spin-off into a separate tool, we’ll see.
    • I’ve been wanting to get back to the network related code for some time, and that’s probably going to happen soon-ish.
    • There is still work that could be done with the VM/FS code I’ve worked on over the last year and a half, but it’s already finding bugs, so is “good enough” for a while.
    • There are also still a few lingering bugs that I need to one day sit down and figure out, but they happen so infrequently that I’ve not found the time so far.
    • Finally, there is also a bunch of other feature work that needs fleshing out that I don’t see myself getting to any time soon, that I’ll dump in the TODO over the next week.
  • Upstream testing.
    Things like my daily running of various stress tools against Linus’ master branch with debug builds won’t happen to the level they were previously. The good news is that instead I’ll be doing a lot more testing on various stable branches, which I never did a whole lot of in the past. I expect over time as I get a better feel for my workload I might be able to ramp back up master testing somewhat, but it will be a secondary thing that I do in addition to everything else. Right now, I’m not really doing any of this, so other people running things like trinity against 3.19/3.20 would probably be a good idea if you like collecting crashes, at least until I find my feet again.
  • Coverity scans.
    • I’ll keep doing the scan.coverity runs hopefully once per -rc. I’ve lapsed a little right now, but will pick this back up again soon to get things back up to current. It takes a lot longer to run a whole scan from home now that I don’t have access to my 24-thread Nehalem (was: < 1 hours for compile, pack & upload to coverity, now: ~4 hours just for the compile), so I'll automate these to run overnight when Linus makes a new snapshot, at least until I put together something a little faster than my ~7 year old core duo.
    • I’m not sure I’m going to have a huge amount of time for triage work yet, so we’ll see how that works out. I might have to focus just on certain areas.
    • There will be some element of related work in my new job, but I’m not sure to what degree yet, more info on that as I figure it out.
  • Fedora.
    The biggest change of all.
    • I’m just not going to have to maintain packages, read mail etc for Fedora, so those all got orphaned yesterday.
    • Josh & Justin pretty much handled all of the Fedora kernel work for the last year or so, so me walking away is not going to make a huge difference there.
    • I might still occasionally take a peek at Fedora bugzilla to see if there’s anything similar to a particular bug, but don’t expect to be doing triage work.
    • I’ll still keep a Fedora box or two at home for a while, but work-wise, I’m expecting a lot more Debian in my life. It’s been over a decade since I last used it seriously. That should prove to be fun.
  • Conferences etc.
    I’ll hopefully be seeing some familiar faces again later this year, and possibly meeting some new ones. The only real change here will be the lack of fudcon/flock/whatever it’s called this week.

So that’s about it. For at least the first few months of 2015, I expect to be absorbed in getting acclimated my new job, so I won’t be as visible as usual, but I’m not going to disappear from the Linux community forever, which was something I made clear wasn’t something I wanted to happen at everywhere I interviewed.

Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.


In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

  • Our insistence on shipping the latest code, with as few 'special sauce' patches won over a lot of upstream developers that wouldn't have given us the time of day for similar bugs back in the RHL days. Sometimes painful for our users, but Linux as a whole got better because of our stance here.
  • Decisions like Fedora enabling debug options by default in betas shook out an unbelievable number of bugs almost as soon as they get introduced. Again, painful for users, but from a quality standpoint, we found a ton of bugs in code others were racing to ship first and call "enterprise ready".
  • Fedora enabling features sometimes before they were fully baked got us a lot of love from their respective upstream maintainers.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.

Trinity and pages of random data.

Something trinity uses a lot, are pages of random data. They get passed around to syscalls, ioctls, whatever. 5 years ago, before I’d even added multiple children to trinity, this was done using ‘page_rand’. A single page allocated on startup, that was passed around, and scribbled over by anyone who needed something to scribble over.

After the VM work I did earlier this year, where we recycle successful calls to mmap, and inherit them across children, quite a few places started passing around map structs instead. This was good, because it started shaking out the many many kernel bugs that we had lingering in huge page support.

It kind of sucked that we had two sets of routines for doing things like “get a page”, “dirty a page” etc which were fundamentally the same operations, except one set worked on a pointer, and one on a struct. It also sucked that the page_rand code was actually buggy in a number of ways, which showed up as overruns.

Over time, I’ve been trying to move all the code that used page_rand to using mappings instead. Today I finished that work, and ripped out the last vestiges of page_rand support. The only real remnants of the supporting code was some of the dirtying code. We used to have separate ‘dirty page_rand’ and ‘dirty an mmap’ routines. After todays work, there’s now a single set of functions for mappings. There’s still a bunch more consolidation and cleanup to do, which I’ll get fixed up and merged over the next week.

The only feature that’s now missing is periodic dirtying of mappings. We did this every 100 syscalls for page_rand. Right now we only dirty mmap’s after a mmap() call succeeds, or on an mremap(). I plan on getting this done tomorrow.

The motivation for ripping out all this code, and unifying a lot of the support code is that a lot of code paths get simpler, and more importantly, the code in place now takes ‘len’ arguments, so we’re in a better position to make sure we’re not passing buffers that are too small when we do random syscalls.

In other news: while I was happy to report a few days ago that 3.18rc1 fixed up the btrfs bug that had been bothering me for a while, I’ve now managed to discover two new btrfs bugs [1]. [2]. Grumble.

Trinity updates

Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).

The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.

First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.

After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.

There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.

Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.

There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.

In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.

Trinity threading improvements and misc

Since my blogging tsunami almost a month ago, I’ve been pretty quiet. The reason being that I’ve been heads down working on some new features for trinity which have turned out to be a lot more involved than I initially anticipated.

Trinity does all of its work in child processes continually forked off from a main process. For a long time I’ve had “investigate using pthreads” as a TODO item, but after various conversations at kernel summit, I decided to bump the priority of that up a little, and spend some time looking at it. I initially guessed that it would have take maybe a few weeks to have something usable, but after spending some time working on it, every time I make progress on one issue, it becomes apparent that there’s something else that is also going to need changing.

I’m taking a week off next week to clear my head and hopefully return to this work with fresh eyes, and make more progress, because so far it’s been mostly frustrating, and there may be an easier way to solve some of the problems I’ve been hitting. Sidenote: In the 15+ years I’ve been working on Linux, this is the first time I recall actually ever using pthreads in my own code. I can’t say I’ve been missing out.

Unrelated to that work, a month or so ago I came up with a band-aid fix for a problem where trinity would corrupt its own structures. That ‘fix’ turned out to break the post-mortem work I implemented a few months prior, so I’ve spent some time this week undoing that, and thinking about how I’m going to fix that properly. But before coming up with a fix, I needed to reproduce the problem reliably, and naturally now that I’ve added debug code to determine where the corruption is coming from, the bug has gone into hiding.

I need this vacation.

A breakdown of Linux kernel networking related issues from Coverity scan

For the last of these breakdowns, I’ll focus on fifth place: networking.

Linux supports many different network protocols, so I spent quite a while splitting the net/ tree into per-protocol components. The result looks like this.

Net-802 8
Net-Bluetooth 15
Net-CAIF 9
Net-Core 11
Net-DCCP 5
Net-DECNET 5
Net-IRDA 17
Net-NFC 11
Net-SCTP 18
Net-SunRPC 21
Net-Wireless 9
Net-XFRM 6
Net-bridge 14
Net-ipv4 24
Net-ipv6 16
Net-mac80211 12
Net-sched 5
everything else 124

The networking code has gotten noticably better over the last year. When I initially introduced these components they were all well into double figures. Now, even crap like DECNET has gotten better (both users will be very happy).

“Everything else” above is actually a screw-up on my part. For some reason around 50 or so netfilter issues haven’t been categorized into their appropriate component. The remaining ~70 are quite a mix, but nearly all small numbers of issues in many components.Things like 9p, atm, ax25, batman, can, ceph, l2tp, rds, rxrpc, tipc, vmwsock, and x25. The Lovecraftian protocols you only ever read about.

So networking is in pretty good shape considering just how much stuff it supports. While there’s 24 issues in a common protocol like ipv4, they tend to be mostly benign things rather than OMG 24 WAYS THE NSA IS OWNING YOUR LINUX RIGHT NOW.

That’s the last of these breakdowns I’ll do for now. I’ll do this again maybe in six months to a year, if things are dramatically different, but I expect any changes to be minor and incremental rather than anything too surprising.

After I get back from kernel summit and recover from travelling, I’ll start a series of posts showing code examples of the top checkers.