Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).
The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.
First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.
After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.
There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.
Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.
There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.
In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.