Weekly Fedora kernel bug statistics – May 11th 2012

  15 16 17 rawhide  
Open: 311 583 76 141 (1111)
Opened since 2012-05-04 3 37 14 0 (54)
Closed since 2012-05-04 3 43 12 1 (59)
Changed since 2012-05-04 10 61 15 4 (90)

F16 bug count is out of control at this point. We’re still managing to close them faster than they’re coming in, but the sheer number of bugs still open is overwhelming. There still exist a lot of older bugs that may have been fixed but the user hasn’t updated the bug. As we’re coming across these bugs that have been in needinfo state for a while we’re closing them out if we have a reasonable expectation that the bug is fixed.

There’s also still a huge number of duplicate bugs, where it’s not immediately obvious that they are duplicates. The end frame in the call chain may be the same, but how we got there differs for example. Or we see multiple instances of a particular trace, from various hardware. (A good example of this case being the numerous irqpoll bugs open for which we still have no real explanation. Likewise softlockup reports).

A curious observation: We’ve had a slight uptick lately in the number of bugs that seem to be caused by bad memory or other hardware problems. Sometimes these stick out like a sore thumb (a single bit flips changing an address from ffffffff81c8bca0 to ffefffff81c8bca0 for example). Other times, they aren’t as obvious.
I have been wondering though if we see more of these in Fedora just due to the extra debug options we leave enabled even in the production builds (like linked list debug, which is pretty much free).

Something that has been interesting though when we’ve found a user with a bad ram report, is searching for other bugs that user filed. Suddenly “weird” oopses make a lot more sense.

Weekly Fedora kernel bug statistics – April 13 2012

  15 16 17 rawhide  
Open: 313 549 63 141 (1066)
Opened since 2012-04-06 1 33 11 2  
Closed since 2012-04-06 33 43 7 1  
Changed since 2012-04-06 48 67 19 5  

Slowly putting a dent in the f15/f16 backlog, now that we have 3.3.1 updates available for both with the i915 memory corruption fix. As can be seen from the ‘closed’ links above, lots of the “weird shit happened” bugs are now closed out. Still a huge number of outstanding bugs though.

Note for Fedora 15 users: Kernel updates are languishing so long in updates-testing that they keep getting obsoleted by new builds before the old one recieves enough Karma.

resolution on the i915/hibernate memory corruption bug

It seems Dave Airlie managed to track down and fix the bug that has been plagueing hibernate/i915 users for over a year.
My recent posts here showed some reports of memory corruption where we occasionally saw a strip of eight pixels. Sometimes 0×00000000, and sometimes 0x00aaaaaa.
In hindsight, it seems obvious. It’s a framebuffer cursor.

https://bugzilla.kernel.org/attachment.cgi?id=72739 should fix it.

There are still a whole bunch of other hibernate bugs that need fixing, but this at least is a huge step forward.

Weekly Fedora kernel bug statistics – March 23 2012

  15 16 17 rawhide  
Open: 346 505 49 142 (1042)
Closed since 2012-03-16 5 68 11 5  
Changed since 2012-03-16 13 407 21 4  

This week saw a lot of activity in Fedora 16. We pushed out a rebase to 3.3. Yesterday I did a mass-update to bugzilla requesting people retest. (Which screwed up when I got a bugzilla proxy error, causing everything to be posted 3 times. Apologies to those who got all those mails).

The good news is we seem to be closing a lot of bugs quickly from this rebase.
Around 50 or so bugs got closed just since yesterday.
As with any rebase, some new bugs are showing up, but some of the more common ones (like the bluetooth oopses) I think we’ve got a handle on, and will get fixed in an update out next week.

Next week, 3.3.1 should also land, and that will likely coincide with Fedora 15 also getting a rebase to this kernel. It’s been a while since we’ve had three releases all on the same version. (I think last time was F12/F13/F14).

Despite all this activity, the overall bug count still remains very high. We still have no real answer for all the memory corruption problems caused by hibernate (which accounts for dozens of open bugs now, plus probably a bunch that we haven’t yet attributed to this problem).

Weekly Fedora kernel bug statistics – March 16 2012

  15 16 17 rawhide  
Open: 349 509 31 144 (1033)
Closed since 2012-03-09 2 53 16 4  
Changed since 2012-03-09 13 115 9 4  

It’s pretty clear to see that f16 has been seeing most of the attention this week. (As it has been for most of the month).

Overall bug count creeping up, but some of the long standing bugs like the various corruption bugs mention in last weeks posts are slowly edging their way to closure. Andrea Arcangeli fixed bug in the transparent huge page code which explains a bunch of the page table corruption issues we saw reported. The i915/hibernate/memory corruption bug has at least some theories on a potential solution now, and we’ve been able to attribute another bunch of the ‘weird’ bugs to post-hibernate corruption now that we know what patterns to look for.

With 3.3 likely being released next week, we’ll be moving F16 to it (and F15 a week later). Hopefully we get to close out a bunch more of the longer standing bugs that haven’t been backported to -stable.

i915, hibernate, and memory corruption.

While going through some of the stranger unexplained bugs last week, I noticed that in a few cases, the hex pattern 0x00aaaaaa showed up a few times. Sometimes, these patterns aren’t obvious when you don’t know what you’re looking for.

After a discussion with Keith Packard about the ongoing memory corruption problems we’re seeing with i915 & hibernation, he pointed out that 0x00aaaaaa is likely a strip of grey pixels in ARGB format. Given that the pattern repeated through memory 8 times, it’s even more likely to be a strip of pixels.

The question still remains as to why these pixels are being scribbled into random memory. Keith’s current theory is that the GTT contains stale entries from before the hibernate, which on resume don’t point to the same memory they did before. Sounds plausible, and would explain a whole number of weird bugs that we’ve seen the last 9 months or so.

Petr Tesarik did some debugging on probably the same problem in June 2011.
There are also numerous other incidences of this corruption in other mailing lists/bug trackers.