With the success that afl has been having on fuzzing userspace, I’ve been revisiting an idea that Andi Kleen gave me years ago for trinity, which was pretty much the same thing but for kernel space. I.e., a genetic algorithm that rates how successful the last fuzz attempt was, and makes a decision on whether to mutate that last run, or do something completely new.
It’s something I’ve struggled to get my head around for a few years. The mutation part would be fairly easy. We would need to store the parameters from the last run, and extrapolate out a set of ->mutate functions from the existing ->sanitize functions that currently generate arguments.
The difficult part is the “how successful” measurement. Typically, we don’t really get anything useful back from a syscall other than “we didn’t crash”, which isn’t particularly useful in this case. What we really want is “did we execute code that we’ve not previously tested”. I’ve done some experiments with code coverage in the past. Explorations of the GCOV feature in the kernel didn’t really get very far however for a few reasons (primarily that it really slowed things down too much, and also I was looking into this last summer, when the initial cracks were showing that I was going to be leaving Red Hat, so my time investment for starting large new projecs was limited).
After recent discussions at work surrounding code coverage, I got thinking about this stuff again, and trying to come up with workable alternatives. I started wondering if I could use the x86 performance counters for this. Basically counting the number of instructions executed between system call enter/exit. The example code that Vince Weaver wrote for perf_event_open looked like a good starting point. I compiled it and ran it a few times.
$ ./a.out Measuring instruction count for this printf Used 3212 instructions $ ./a.out Measuring instruction count for this printf Used 3214 instructions
Ok, so there’s some loss of precision there, but we can mask off the bottom few bits. A collision isn’t the end of the world for what we’re using this for. That’s just measuring userspace however. What happens if we tell it to measure the kernel, and measure say.. getpid().
$ ./a.out Used 9283 instructions $ ./a.out Used 9367 instructions
Ok, that’s a lot more precision we’ve lost. What the hell.
Given how much time he’s spent on this stuff, I emailed Vince, and asked if he had insight as to why the counters weren’t deterministic across different runs. He had actually written a paper on the subject. Turns out we’re also getting event counts here for page faults, hardware interrupts, timers, etc.
x86 counters lack the ability to say “only generate events if RIP is within this range” or anything similar, so it doesn’t look like this is going to be particularly useful.
That’s kind of where I’ve stopped with this for now. I don’t have a huge amount of time to work on this, but had hoped that I could hack up something basic using the perf counters, but it looks like even if it’s possible, it’s going to be a fair bit more work than I had anticipated.
It occurred to me after posting this that measuring instructions isn’t going to work regardless of the amount of precision the counters offer. Consider a syscall that operates on vma’s for example. Over the lifetime of a process, the number of executed instructions of a call to such a syscall will vary even with the same input parameters, as the lengths of various linked lists that have to be walked will change. Number of instructions, or number of branches taken/untaken etc just isn’t a good match for this idea. Approximating “have we been here before” isn’t really achievable with this approach afaics, so I’m starting to think something like the initial gcov idea is the only way this could be done.