|Opened since 2014-05-01||9||112||20||(141)|
|Closed since 2014-05-01||26||68||17||(111)|
|Changed since 2014-05-01||49||293||31||(373)|
Found another hugepage VM bug.
This one VM_BUG_ON_PAGE(PageTail(page), page). The filemap.c bug has gone into hiding since last weekend. Annoying.
On the trinity list, Michael Ellerman reported a couple bugs that looked like memory corruption since my changes yesterday to make fd registration more dynamic. So I started the day trying out -fsanitize=address without much success. (I had fixed a number of bugs recently using it). However, reading the gcc man page, I found out about -fsanitize=undefined, which I was previously unaware of. This is new in gcc 4.9, so ships with Fedora rawhide, but my test box is running F20. Luckily, clang has supported it even longer.
So I spent a while fixing up a bunch of silly bugs, like (1<<31) where it should have been (1UL<<31) and some not so obvious alignment bugs. A dozen or so fixes for bugs that had been there for years. Sadly though, nothing that explained the corruption Michael has been seeing. I spent some more time looking over yesterdays changes without success. Annoying that I don’t see the corruption when I run it.
Decided not to take on anything too ambitious for the rest of the day given I’m getting an early start on the long weekend, by taking off tomorrow. Hopefully when I return with fresh eyes on Tuesday I’ll have some idea of what I can do to reproduce the bug.
I was a little surprised last week when I did the Coverity scan on the 3.15-rc3 kernel, and 118 new issues appeared. Usually once we’re past -rc1, the number of new issues between builds is in low double figures. It turned out to be due to Coverity’s new checks that flag byte swaps as potential sources of tainted data.
There’s a lot to go through, but from what I’ve seen so far, there hasn’t been anything as scary as what happened with OpenSSL. Though my unfamiliarity with some of the drivers may mean they could turn out to be something that needs fixing even if they aren’t exploitable.
What is pretty neat though, is that the scanner doesn’t just detect calls to the usual byte-swapping routines, it even picks up on open-coded variants, such as..
n_blocks = (p_src < < 8) | p_src;
The tricky part is determining in various drivers if (ie, in the example above) 'p_src' is actually something a user can control, or is it just reading something back from hardware. Quite a few of them seem to be in things like video BIOS parsers, firmware loaders etc. Even handling of packets from a joystick involves byte-swapping.
Even though the hardware (should) limit the range of values these routines see, we do seem to do range checking in the kernel in most cases. The cases where we don't that I've seen so far, I'm not sure we care, as the code looks like it would still work correctly.
Somewhat surprising was how many of these new warnings turned up in mostly unmaintained or otherwise crappy parts of the kernel. ISDN. Staging wireless drivers. Megaraid. The list goes on..
Regardless of whether or not these new checks find bugs, re-reviewing some of this code can't really be a bad thing, as a lot of it probably never had a lot of review when it was merged 10-15 years ago.
The big thing that stands out this cycle is that the defect ratio was going down until we hit around 3.14-rc7, and then we got a few hundred new issues. What happened ?
Nothing in the kernel thankfully. This was due to an upgrade server side to a new version of Coverity which has some new checkers. Some of the existing ones got improved too, so a bunch of false positives we had sitting around in the database are no longer reported. The number of new issues unfortunately was greater than the known false positives. In the days following, I did a first sweep through these and closed out the easy ones, bringing the defect density back down.
note: I stopped logging the ‘dismissed’ totals. With Coverity 7.0, the number can go backwards.
If a file gets deleted, the issues against that file that were dismissed also disappears.
Given this happens fairly frequently, the number isn’t really indicative of anything useful.
With the 3.15 merge window now open, I’m hoping a bunch of the queued fixes I sent over the last few weeks get merged, but I’m fully expecting to need to do some resending.
 It was actually worse than this, the ratio went back up to 0.57 right before rc7