linked list debugging

After looking through all the general protection bugs yesterday, today I spent some time trying to correlate the various reports we have of the linked list debugging being triggered

I added this code upstream in 2006, that gets called whenever a node is added/removed to linked lists, it checks that the prev->next and next->prev of the nodes points to the right places. It trips up surprisingly often. So much so, that we leave it permanently enabled in Fedora. The overhead is negligible, and it finds genuine bugs.

Sifting through the current reports, a few things jumped out at me.

784741,799229 & 799692: list_del corruption. prev->next should be ffff8801c2f41b18, but was (null)
These all look to be the same bug. The inode writeback list got corrupted. upstream discussion points at a commit that split the lock up as a potential cause. Investigation ongoing.

787044: list_del corruption, ffff88013008e520->next is LIST_POISON1 (dead000000100100)
Like the previous bugs, this is in the inode eviction path. This looks to be unrelated however, as it’s a different linked list (this time one private to jbd2).
Of note is that this happened on return from hibernate, where we know we have a number of memory corruption bugs. LIST_POISON showing up in objects on lists shouldn’t happen, because we take them off the list and fix up the pointers before we do the poisoning. Re-adding a poisoned object isn’t possible, as the pointers would be fixed up correctly at list_add. So it’s not clear to me how this got in this state.

722723: list_del corruption. prev->next should be ffffea0003f8d4f0, but was dead000000100100
Another case of the same dead object still on the list problem, but in a different code path. This time the vm lru page list.
The next Fedora 15/16 updates have some additional VM debugging enabled, which hopefully yields some clues to this sort of problem.

760642: list_add corruption. prev->next should be next (ffff880036835630), but was ffff880036875630. (prev=ffff880036835630).
769576: list_del corruption. next->prev should be ffff8801b068a6e0, but was fdff8801b068a6e0
788280: list_del corruption. prev->next should be ffff88003b0431a8, but was ffff8800390431a8
In all three of these cases, a single bit in the address is different from what was expected. This is usually indicative of a hardware fault of some kind (bad memory, poor cooling, insufficient power etc). It’s possible that something randomly set/unset some bit in memory, but usually random memory corruption happens on at least a byte level.

796109: list_del corruption. next->prev should be ffff88007f411168, but was 00ffffff00ffffff
This one jumped out at me, because it reminded me of bug 746482 from yesterday. Completely different bug, but the pattern looks similar. It looks like something wrote a zero byte to every fourth byte. Almost as if some code somewhere has its pointer arithmetic wrong, and instead of increasing byte at a time zeroing an array, it’s scribbled past the end of it by increasing by sizeof(int) instead of sizeof(char). Trying to find out what’s doing that writing however, is the hard part.

As with all commonly reported bugs, there’s a bunch of other list_debug reports, that don’t fit any particular profile.