Trinity 1.6

As alluded to in my last post, a few days ago I released a new version of Trinity.
The bulk of the work in this release happened prior to my burn out back in July. The combination of everything described in that post, and general unhappiness in my last job etc led to me just wanting to walk away from everything for an indeterminate amount of time.

Distance is good. I’ve continued to poke at trinity in small amounts since then. At last weeks kernel summit, a number of people expressed just how useful they find Trinity and how much they were bummed to find out I wasn’t working on it any more. With that feedback, I felt motivated to clean the decks and get 1.6 out. There’s a short description of most of the bigger changes below, but there were probably a whole bunch more changes made that I forgot to highlight in the shortlog.

With that release wrapped up, and with the fresh perspective of having been ‘away’ from the project for a while, when I was travelling last week, I started work on some new features, starting with implementing a generic object cache instead of hard coding a “remember this” set of functionality for every single object type a syscall could return. A relatively small amount of code, which should make life easier to support recycling syscall results for syscalls other than mmap (which is all that’s implemented right now).

So,.. while I’m working on this stuff again, it’s not the comeback many would like. I don’t know just how much time I’m going to have to devote to working on Trinity. From time to time, I suspect I’ll find some intersection between my work at Facebook and the sort of targeted testing that Trinity is useful for, but it’s not my primary focus, and probably won’t be again. Additionally, I’ve got a bunch of ideas for new projects I’m itching to work on that spawned from discussions last week, so “spare time” hacking effort might be devoted more to them in future.

tl;dr: Don’t send me feature requests. I’ve got more than enough ideas for stuff *I* want to implement. Diffs speak louder than words.

Summary of some of the bigger changes to Trinity since the last (1.5) tarball release include:

  • Assorted improvements to the tuned random number generation.
    (Including one particularly stupid bug where sometimes all child processes would get the same seed, and end up doing the same syscalls. oops)
  • Various networking related improvements/fixes:
    • tcp: add TCP_TIMESTAMP, TCP_NOTSENT_LOWAT & TCP_CC_INFO socket options.
    • ipv6: Improved generation of random addresses. (No longer just localhost)
    • ipv6: Added 14 missing socket options.
    • ipv6: Now passes correct lengths for socket options. (Note: This change may break older glibcs: See this patch.)
    • Beginnings of some better proto-alg sockaddr generation.
    • Recognise PF_IB and PF_MPLS network protocols
    • Socket generation improvements. (Picks right socket type to go with protocol)
    • Now supports an ARG_SOCKETINFO for syscalls that operate primarily on sockets. (Still occasionally passes random fd’s)
    • accept,accept4,bind,connect,getpeername,getsockname,recv,setsockopt,send converted to use ARG_SOCKETINFO.
    • setsockopt now also matches the protocol of the socket passed to the right setsockopt args.
    • netlink socket generation fix (pid is a portid, not a process id)
    • The -P parameter no longer accepts the incomprehensible numeric form of arguments, just names.
    • The PF_ prefix to the -P parameter is now optional, so you can just say ‘UNIX’ instead of ‘PF_UNIX’.
  • Updates to keep up with new upstream kernel changes.
    • Updated perf_event_open syscall to include 4.1 changes
    • Updated syscall lists
      alpha: execveat, getrandom, memfd_create
      s390[x]: execveat, NUMA related syscalls
      parisc: execveat
    • mips: add new prctls for PR_SET_FP_MODE / PR_GET_FP_MODE
    • Support for new fallocate flags (FALLOC_FL_INSERT_RANGE)
  • Watchdog:
    • Remove some false-positive triggering checks from the watchdog.
    • Watchdog process is now nice’d to -19
    • Monitor how many processes are currently stalled.
    • If all child processes are stalled, send SIGKILLs to 50%
  • Misc:
    • New fd generators for drm dumb buffers & inotify watches.
    • blacklist /dev/sd* from the fd list, so we can be a bit safer when running as root with –dropprivs
    • Fixed the ‘bind process to CPU’ code to only pick online CPUs.
    • Self-corruption checks added to child processes, like the watchdog code already did.
    • Remove guard pages around shm.
    • In debug mode, write protect the shm before making syscalls.
    • Refactoring of logging code.
    • Various code cleanups as usual.
    • No longer tries to mmap 1GB pages if running with less than 8GB free.

kernel summit 2015 wrap-up

Exhausting travel aside, kernel summit in Seoul was a good use of time.
Most of the sessions didn’t feel as interactive as prior years, in part I think because there really wasn’t a lot of objection, even to some
of the more controversial things. Kees’ security talk went over pretty well even if it did depress most the people in the room. Hopefully something good will come of it. The restartable sequences feature got talked about but didn’t get much (if any) real pushback.

There were a few hallway discussions surrounding various upcoming
kernel functionality that didn’t get ‘airtime’ in the sessions.
The kernel TLS stuff was probably discussed more in depth at netconf, and assorted VM features were covered more at LSFMM
earlier this year. Quite a few people talking excitedly about eBPF, both from a networking point of view, and soon.. tracing.
Quite a few people still seem concerned (rightly) about the upcoming unpriveledged bpf syscall.

It seems by fracturing the kernel summit into lots of smaller events the deep-dives into new features/problems happens there, leaving the kernel summit more for executive summary type talks, and as has been the general push over the last decade more and more process related discussions.

On process, Sasha’s discussion on stable was probably the most interesting to me personally. GregKH agreed to make 4.4 the next LTS starting a new tradition of “the next LTS is the one after the kernel summit”. We’ll see how that works out.

Chris Mason gave a “what went good/bad when facebook moved to 4.0” talk. Which for the most part, was all good. There are a few small things that are still being shaken out, but it’s by no means awful.

I had a lot of hallway conversations that began “so, trinity..”
The short answer there is that I’m still working on it, though at a much reduced pace than I was a year ago. It was good to hear feedback
from pretty much everyone I talked to that it was something that people value, which was a good motivator. More on that later.

I also had a lot of people asking a lot of questions about my Facebook bootcamp experience. I’ll do a longer write-up of that soon.

The case of the mysterious disappearing I211

Day one of unemployed life saw me finally getting around to the first of several hardware related maintenance items that I’ve been putting off until I’ve had the time.

I got a lot of life out of my desktop machine that I had been using since 2007. Earlier this year, I decided it was long overdue an upgrade, and ended up building a ridiculously over-specced machine in the hopes it too would last me a while. After some research, I ended up with a 6-core Haswell-E i7-5820K, and a frankly ridiculously over-featured motherboard.
Once I had delved through the absurd number of BIOS options to convince it that I *really* didn’t want to overclock my CPU or my RAM, or anything else, it was very stable.

It has exceeded all my expectations. In the time it took my old desktop to build one kernel, I can build kernel .deb’s for every machine I own, and still have time spare. It’s an absolute beast.

One of the features that sold me on this board was the two onboard ethernet ports. I had been wanting to do a bunch of networking experiments, and the possibility of using bonding, without having to screw around with add-in cards was appealing.

So I was a little irked one evening after updating its BIOS, to notice that the bond only had one interface active. After some investigation, I noticed that the PCI ID of one of the onboard NICs had changed.

What was once

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
08:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

Was now

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
08:00.0 Ethernet controller: Intel Corporation Device 1532 (rev 03)

My I211 had changed its PCI ID, and the e1000 driver wouldn’t bind to this new device.

At first I thought “Cool, some kind of NIC firmware update”, and assumed that e1000 hadn’t been updated yet to support this new feature. Googling for “i211 1532” told a much sadder story however.

If you read the spec update for the i211, you find this interesting table:

I211 Device ID Code Vendor ID Device ID Revision ID1
WGI211AT (not programmed/factory default) 0x8086 0x1532 0x3
WGI211AT (programmed) 0x8086 0x1539 0x3

Uh, not cool. Somehow the BIOS update procedure had wiped the NVRAM on the NIC.

A long protracted conversation with ASUS support followed, including such gems as “I understand you’re seeing blue screens” and “Have you tried removing the DIMMs, rubbing the contacts with an eraser and replacing them”. Eventually I think they got to the end of their script, and agreed to RMA the board. Somewhat annoying, given there’s probably a tool somewhere that can rewrite the flash, but Intel only seems to make that available to integrators, not end-users, and the ASUS representatives denied all knowledge.

It was gone for about two weeks, and finally returned yesterday. Its PCI ID is 0x1539 again, and it has its old MAC address once more. (I’m now hesitant to ever upgrade the BIOS on this machine again). So what happened ? Anyone’s guess, but this isn’t the first time I’ve seen this happen. We had a bunch of these NICs at Akamai too that occasionally had the same thing happen to them.

The whole thing is reminiscent of a painful old bug where ftrace would corrupt the e1000e ROM. Hopefully Linux isn’t to blame this time.

So, long story short: If you see an i211 with a PCI ID of 1532, you’re looking at an RMA.

Moving on from Akamai.

Today was my last day at Akamai. It’s been brief (Just over seven months), but things weren’t really working out for me there for a number of reasons. I’ve mentioned to a number of people who have known about my decision for a while, that it’s not that it’s a bad place to work, but it never felt like a good fit for me, and I came to realize that I’ve spent most of this last year being in denial of just how unhappy I was, in the hope “things would get better”.

There are a lot of smart people working there, working on really difficult problems, but a lot of those problems just don’t align with my interests, especially when they don’t always involve contributing code back upstream. [clarification: There is some upstream work going on there, just not as much as I’d like].

Add to this my disdain for some of the proprietary tooling that’s prevalent there, and it was becoming clear it was not a matter of “if”, but “when” I was going to leave. As an example; I joked a few months ago to co-workers “next time I’m looking for a job, the first question I ask is ‘do you use perforce’?”. Only it wasn’t really a joke, I was dead serious. User-hostile software has no place in my life.
Even little things like “let’s use git” translating to “let’s license Atlassian stash” rather than “run a git-daemon somewhere” started getting me down.

The final project I worked on there was a continuous rebase strategy for the kernel, moving away from perforce to git. It’s a move in the right direction, but ultimately, not the sort of work that gets me excited, and it’s going to be a multi-year project before it starts really bearing fruit. Given how perforce is ingrained in so many of Akamai’s systems, it would also have been extremely unlikely I’d have been able to purge all knowledge of ever having used it.

The rebase work itself also started to bother me that many of the kernel changes we made had no chance of ever even being submitted, let alone accepted upstream. (In part because many of them are very unique to Akamai’s CDN — you won’t find any of the trickery employed there described in a Richard Stevens book, and they’re unlikely to ever be official RFC’s due to the competitive edge they gain from those changes).
There are exceptions to all of this, and the kernel team is trying to do a better job there with upstreaming most of the newer changes, but many of the older legacy patches are under-documented, and/or understood well by few people, with the original authors no longer around, making it a frustrating exercise to get up to speed; especially when you’re trying to learn what the upstream code is doing at the same time.

Someone with less experience dealing exclusively with open-source for most of their career would probably find many of my reasons for leaving trivial. Those same people would probably find Akamai a great place to work. There are a lot of opportunities there if you have a higher tolerance for such things than I did. It was eye-opening recently, mentoring some of the interns there. Optimism. The unjaded outlook that comes with youth. Not getting bent out of shape at crappy tooling because they don’t know different. It made me realize I wasn’t going to ever be like this here.

On a particularly bad day a few weeks back, a recruiter reached out to me, to find out if I was interested in a second chance at an offer I received last time I was looking for a new job. It worked. Enduring an unhappy situation in the hopes things will get better isn’t a great strategy when there are other options.

So, I start at Facebook in September.

I have no delusions that things are going to be perfect there, but at least from the outside right now, the grass looks greener. I feel bad walking away from problems unfinished, but going home miserable or angry or some other negative emotion every day was really starting to get take its toll. It’s not a healthy way to live.

When I was interviewing last December, I read Being Geek to death, so it’s fitting that I’ve picked it up again recently. One paragraph in particular jumps out at me.

My single worst gig was one where I got everything I wanted out the of the offer letter, but in my exuberance for being highly valued, I totally forgot that my gut read on the gig was "meh". Ninety days later, I couldn't care less that I got a 15% raise and a sign-on bonus. I couldn't stand the mundanity of the daily work, and I happily resigned a few months later, taking both a pay cut and returning my sign-on bonus for the opportunity to work at Netscape.

Anachronisms and minor details aside, that paragraph played through my head this afternoon as I wrote the check to pay back the remainder of my sign-on bonus. I wasn’t quite thinking “meh”, but I knew I was making compromises on what I really valued from day one.

Walking away from unvested RSUs, giving up this months paycheck, and writing that check stings a little, but when I did my exit interview this morning, I knew that I too, was “happily resigning” for a great opportunity.

I’m feeling uncharacteristically optimistic right now. Hopefully it’ll last.

I’ll be in Seattle next week, but due to complications with my registration being transferred to another Akamai employee, I won’t actually be at the Linux plumbers conf. If you’re also going to be there and want to catch up, drop me a mail, or <ahem> hit me up on facebook.

Future development of Trinity.

It’s been an odd few weeks regarding Trinity based things.

First an email from a higher-up at my former employer asking (paraphrased)..

"That thing we asked you to stop working on when you worked here, any chance now you've left you'll implement these features."

I’m still trying to get my head around the thought process that led to that being a reasonable thing to ask. I’ve made the occasional commit over the last six months, but it’s mostly been code motion, clean-up work, and things like syscall table updates. New feature development came to a halt long ago.

It’s no coincidence that the number of bugs reported found with Trinity have dropped off sharply since the beginning of the year, and I don’t think it’s because the Linux kernel suddenly got lots better. Rather, it’s due to the lack of real ongoing development to “try something else” when some approaches dry up. Sadly we now live in a world where it’s easier to get paid to run someone else’s fuzzer these days than it is to develop one.

Then earlier this week, came the revelation that the only people prepared to fund that kind of new feature development are pretty much the worst people.

Apparently Hacking Team modified Trinity to fuzz ioctl() on Android, which yielded some results. I’ve done no analysis on whether those crashes are are exploitable/fixed/only relevant to Android etc. (Frankly, I’m past caring). I’m not convinced their approach is particularly sound even if it was finding results Trinity wasn’t, so it looks unlikely there are even ideas to borrow here. (We all already knew that ioctl was ripe with bugs, and had practically zero coverage testing).

It bothers me that my work was used as a foundation for their hack-job. Then again, maybe if I hadn’t released Trinity, they’d have based on iknowthis, or some other less useful fuzzer. None of this really should surprise me. I’ve known for some time that there are some “security” people that have their own modifications they have no intention of sending my way. Thanks to the way that people that release 0-days are revered in this circus, there’s no incentive for people to share their modifications if it means that someone else might beat them to finding their precious bugs.

It’s unfortunate that this project has attracted so many awful people. When I began it, the motivation had nothing to do with security. Back in 2010 we were inundated in weird oopses that we couldn’t reproduce, many times triggered by jvm’s. I came up with the idea that maybe a fuzzer could create a realistic enough workload to tickle some of those same bugs. Turned out I was right, and so began a series of huge page and other VM related bug fixes.

In the five years that I’ve made Trinity available, I’ve received notable contributions from perhaps a half dozen people. In return I’ve made my changes available before I’d even given them runtime myself.

It’s a project everyone wants to take from, but no-one wants to give back to.

And that’s why for the foreseeable future, I’m unlikely to make public any further feature work I do on it.
I’m done enabling assholes.

Thoughts on a feedback loop for Trinity.

With the success that afl has been having on fuzzing userspace, I’ve been revisiting an idea that Andi Kleen gave me years ago for trinity, which was pretty much the same thing but for kernel space. I.e., a genetic algorithm that rates how successful the last fuzz attempt was, and makes a decision on whether to mutate that last run, or do something completely new.

It’s something I’ve struggled to get my head around for a few years. The mutation part would be fairly easy. We would need to store the parameters from the last run, and extrapolate out a set of ->mutate functions from the existing ->sanitize functions that currently generate arguments.

The difficult part is the “how successful” measurement. Typically, we don’t really get anything useful back from a syscall other than “we didn’t crash”, which isn’t particularly useful in this case. What we really want is “did we execute code that we’ve not previously tested”. I’ve done some experiments with code coverage in the past. Explorations of the GCOV feature in the kernel didn’t really get very far however for a few reasons (primarily that it really slowed things down too much, and also I was looking into this last summer, when the initial cracks were showing that I was going to be leaving Red Hat, so my time investment for starting large new projecs was limited).

After recent discussions at work surrounding code coverage, I got thinking about this stuff again, and trying to come up with workable alternatives. I started wondering if I could use the x86 performance counters for this. Basically counting the number of instructions executed between system call enter/exit. The example code that Vince Weaver wrote for perf_event_open looked like a good starting point. I compiled it and ran it a few times.

$ ./a.out 
Measuring instruction count for this printf
Used 3212 instructions
$ ./a.out 
Measuring instruction count for this printf
Used 3214 instructions

Ok, so there’s some loss of precision there, but we can mask off the bottom few bits. A collision isn’t the end of the world for what we’re using this for. That’s just measuring userspace however. What happens if we tell it to measure the kernel, and measure say.. getpid().

$ ./a.out 
Used 9283 instructions
$ ./a.out 
Used 9367 instructions

Ok, that’s a lot more precision we’ve lost. What the hell.
Given how much time he’s spent on this stuff, I emailed Vince, and asked if he had insight as to why the counters weren’t deterministic across different runs. He had actually written a paper on the subject. Turns out we’re also getting event counts here for page faults, hardware interrupts, timers, etc.
x86 counters lack the ability to say “only generate events if RIP is within this range” or anything similar, so it doesn’t look like this is going to be particularly useful.

That’s kind of where I’ve stopped with this for now. I don’t have a huge amount of time to work on this, but had hoped that I could hack up something basic using the perf counters, but it looks like even if it’s possible, it’s going to be a fair bit more work than I had anticipated.

It occurred to me after posting this that measuring instructions isn’t going to work regardless of the amount of precision the counters offer. Consider a syscall that operates on vma’s for example. Over the lifetime of a process, the number of executed instructions of a call to such a syscall will vary even with the same input parameters, as the lengths of various linked lists that have to be walked will change. Number of instructions, or number of branches taken/untaken etc just isn’t a good match for this idea. Approximating “have we been here before” isn’t really achievable with this approach afaics, so I’m starting to think something like the initial gcov idea is the only way this could be done.

kernel code coverage brain dump.

Someone at work recently asked me about code coverage tooling for the kernel. I played with this a little last year. At the time I was trying to figure out just how much of certain syscalls trinity was exercising. I ended up being a little disappointed at the level of post-processing tools to deal with the information presented, and added some things to my TODO list to find some time to hack up something, which quickly bubbled its way to the bottom.

As I did a write-up based on past experiences with this stuff, I figured I’d share.

requires kernel built with
Note: Setting GCOV_PROFILE_ALL incurs some performance penalty, so any resulting kernel built with this option should _never_ be used for any kind of performance tests.
I can’t exaggerate this enough, it’s miserably slow. Disk operations that took minutes for me now took hours. As example:


# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.409712 s, 1.3 GB/s
0.00user 0.40system 0:00.41elapsed 99%CPU (0avgtext+0avgdata 2980maxresident)k
136inputs+1024000outputs (1major+340minor)pagefaults 0swaps


# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 6.17212 s, 84.9 MB/s
0.00user 7.17system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2940maxresident)k
0inputs+1024000outputs (0major+338minor)pagefaults 0swaps

From 41 seconds, to over 7 minutes. Ugh.

If we *didn’t* set GCOV_PROFILE_ALL, we’d have to recompile just the files we cared about with the relevant gcc profiling switches. It’s kind of a pain.

For all this to work, gcov expects to see a source tree, with:

  • .o objects
  • source files
  • .gcno files (these are generated during the kernel build)
  • .gcda files containing the runtime counters. These come from sysfs on the running kernel.

After booting the kernel, a subtree appears in sysfs at /sys/kernel/debug/gcov/
These directories mirror the kernel source tree, but instead of source files, now contain files that can be fed to the gcov tool. There will be a .gcda file, and a .gcno symlink back to the source tree (with complete path). Ie, /sys/kernel/debug/mm for example contains (among others..)

-rw------- 1 root root 0 Mar 24 11:46 readahead.gcda
lrwxrwxrwx 1 root root 0 Mar 24 11:46 readahead.gcno -> /home/davej/build/linux-dj/mm/readahead.gcno

It is likely the symlink will be broken on the test machine, because the path doesn’t exist, unless you nfs mount the source code from the built kernel for eg.

I hacked up the script below, which may or may not be useful for anyone else (honestly, it’s way easier to just use nfs).
Run it from within a kernel source tree, and it will populate the source tree with the relevant gcda files, and generate the .gcov output file.

obj=$(echo "$1" | sed 's/\.c/\.o/')
if [ ! -f $obj ]; then

dirname=$(dirname $1)
gcovfn=$(echo "$(basename $1)" | sed 's/\.c/\.gcda/')
if [ -f /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn ]; then
  cp /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn $dirname
  gcov -f -r -o $1 $obj
  if [ -f $(basename $1).gcov ]; then
    mv $(basename $1).gcov $dirname
  echo "no gcov data for /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn"

Take that script, and run it like so..

$ cd kernel-source-tree
$ find . -type f -name "*.c" -exec "{}" \;

Running for eg, mm/mmap.c will cause gcov to spit out a mmap.c.gcov file (in the current directory) that has coverage information that looks like..

   135684:  269:static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
        -:  270:{
   135684:  271:        struct vm_area_struct *next = vma->vm_next;
        -:  272:
   135684:  273:        might_sleep();
   135686:  274:        if (vma->vm_ops && vma->vm_ops->close)
     5080:  275:                vma->vm_ops->close(vma);
   135686:  276:        if (vma->vm_file)
    90302:  277:                fput(vma->vm_file);
        -:  278:        mpol_put(vma_policy(vma));
   135686:  279:        kmem_cache_free(vm_area_cachep, vma);
   135686:  280:        return next;
        -:  281:}

The numbers on the left being the number of times that line of code was executed.
Lines beginning with ‘-‘ have no coverage information for whatever reason.
If a branch is not taken, it gets prefixed with ‘#####’, like so..

  4815374:  391:                if (vma->vm_start < pend) {
    #####:  392:                        pr_emerg("vm_start %lx < pend %lx\n",
        -:  393:                                  vma->vm_start, pend);
        -:  394:                        bug = 1;
        -:  395:                }

There are some cases that need a little more digging to explain. eg:

    88105:  237:static void __remove_shared_vm_struct(struct vm_area_struct *vma,
        -:  238:                struct file *file, struct address_space *mapping)
        -:  239:{
    88105:  240:        if (vma->vm_flags & VM_DENYWRITE)
    15108:  241:                atomic_inc(&file_inode(file)->i_writecount);
    88105:  242:        if (vma->vm_flags & VM_SHARED)
        -:  243:                mapping_unmap_writable(mapping);
        -:  244:
        -:  245:        flush_dcache_mmap_lock(mapping);
    88105:  246:        vma_interval_tree_remove(vma, &mapping->i_mmap);
        -:  247:        flush_dcache_mmap_unlock(mapping);
    88104:  248:}

In this example, lines 245 & 247 have no hitcount, even though there’s no way they could have been skipped.
If we look at the definition of flush_dcache_mmap_(un)lock, we see..
#define flush_dcache_mmap_lock(mapping) do { } while (0)
So the compiler never emitted any code, and hence, it gets treated the same way as the blank lines.

There is a /sys/kernel/debug/gcov/reset file that can be written to to reset the counters before each test if desired.

Additional thoughts

  • Not sure how inlining affects things.
  • There needs to be some element of post-processing, to work out percentages of code coverage etc, which may involve things like stripping out comments/preprocessor defines.
  • debug kernels differ in functionality in various low level features. For example LOCKDEP will fundamentally change the way spinlocks work. For coverage purposes though, we can choose to not care and stop drilling down at certain levels.
  • Whatever does the post-processing of results may need to aggregate results from multiple test machines. Think of the situation where we’re running a client/server test: Both machines will be running different code paths.
  • ggcov has some interesting looking tooling for visually displaying results.

Trinity socket improvements

I’ve been wanting to get back to working on the networking related code in trinity for a long time. I recently carved out some time in the evenings to make a start on some of the lower hanging fruit.

Something that bugged me a while is that we create a bunch of sockets on startup, and then when we call for eg, setsockopt() on that socket, the socket options we pass have more chance of not being the correct protocol for the protocol the socket was created for. This isn’t always a bad thing; for eg, one of the oldest kernel bugs trinity found was found by setting TCP options on a non-TCP socket. But doing this the majority of the time is wasteful, as we’ll just get -EINVAL most the time.

We actually have the necessary information in trinity to know what kind of socket we were dealing with in a socketinfo struct.

struct socket_triplet {
        unsigned int family;
        unsigned int type;
        unsigned int protocol;

struct socketinfo {
        struct socket_triplet triplet;
        int fd; 

We just had it at the wrong level of abstraction. setsockopt only ever saw a file descriptor. We could have searched through the fd arrays looking for the socketinfo that matched, but that seems like a lame solution. So I changed the various networking syscalls to take a ARG_SOCKETINFO instead of an ARG_FD. As a side-effect, we actually pass sockets to those syscalls more than say, a perf fd, or an epoll fd, or ..

There is still a small chance we pass some crazy fd, just to cover the crazy cases, though those cases don’t tend to trip things up much any more.

After passing down the triplet, it was a simple case of annotating the structures containing the various setsockopt function pointers to indicate which family they belonged to. AF_INET was the only complication, which needed special casing due to the multiple protocols for which we have setsockopt() functions. Creation of a second table, using the protocol instead of the family was enough for the matching code.

There are still a ton of improvements I want to make to this code, but it’s going to take a while, so it’s good when some mostly trivial changes like the above come together quickly.

the more things change.. 4.0

$ ping gelk
PING ( 56(84) bytes of data.
WARNING: kernel is not very fresh, upgrade is recommended.
$ uname -r

Remember that one time the kernel versioning changed and nothing in userspace broke ? Me either.

Why people insist on trying to think they can get this stuff right is beyond me.


update: this was already fixed, almost exactly a year ago in the ping git tree. The (now removed) commentary kind of explains why they cared. Sigh.

LSF/MM 2015 recap.

It’s been a long week.
Spent Monday/Tuesday at LSFMM. This year it was in Boston, which was convenient in that I didn’t have to travel anywhere, but less convenient in that I had to get up early and do a rush-hour commute to get to the conference location in time. At least the weather got considerably better this week compared to the frankly stupid amount of snow we’ve had over the last month.
LWN did their usual great write-up which covers everything that was talked about in a lot more detail than my feeble mind can remember.

A lot of things from last years event seem to still be getting a lot of discussion. SMR drives & persistent memory being the obvious stand-outs. Lots of discussion surrounding various things related to huge pages (so much so one session overran and replaced a slot I was supposed to share with Sasha, not that I complained. It was interesting stuff, and I learned a few new reasons to dislike the way we handle hugepages & forking), and I lost track how many times the GFP_NOFAIL discussion came up.

In a passing comment in one session, one of the people Intel sent (Dave Hansen iirc) mentioned that Intel are now shipping a 18 core/36 thread CPU. A bargain at just $4642. Especially when compared to this madness.

A few days before the event, I had been asked if I wanted to do a “how Akamai uses Linux” type talk at LSFMM, akin to what Chris Mason did re: facebook at last years event. I declined, given I’m still trying to figure that out myself. Perhaps another time.

Wednesday/Thursday, I attended Vault at the same location.
My take-away’s:

  • There was generally a lot more positive vibes around btrfs this year. Even with Josef playing bad cop to Chris’ good cop talk, things generally seemed to be moving away from a “everything is awful” toward “this actually works…” though with the qualifier “.. for facebook’s workload”. Josef did touch on one area that btrfs does still suck, which apparently is database workloads (iirc, due to the copy-on-write nature of btrfs). The spurious ENOSPC failures of the past should hopefully stay in the past. Things generally on the up and up. (Though, this does include the linecount, which has now passed 100KLOC, more than double that of XFS or ext*. Scary).
  • Equally positive vibes surrounding XFS. We celebrated the 20 year anniversary at one evening event, making us all feel just that little bit more like an old fart club. Interesting talk toward the end by Dave Chinner about the future of XFS, and how the current surge of development in XFS is probably its last for various scaling reasons as disks continue to get bigger and bigger. Predicting the future is always hard, but if what Dave said was true, things will start to get ‘interesting’ in about 5 years time, given every other filesystem we support in Linux has the same issues (or worse).
  • People still care a lot about NFS. Especially pNFS. Surprising amount of activity still happening.
  • Even when I worked there, I never really got Red Hat’s “big picture” wrt the several distributed filesystems they supported. Now that I’m not there, I feel even more out of the loop. “ceph is the way forward” “except when it’s glusterfs” or something. Oh, and GFS2 is still a thing apparently, for some reason.
  • As entertaining as Jeremy Allison might be, don’t go to a talk on Samba internals unless you work on it (in which case it’s too late for you). The horrors will likely keep you up at night.
  • Ted’s ext4 talk drew a decent crowd. As fancy as btrfs/xfs etc might be, a *lot* of people still give a crap about extN. Somehow I missed the addition of the ‘lazytime’ option to ext4. Seems neat. Played with it (and also the super-secret ‘dioread_nolock’ mount option). Saw another talk on orphan list scalability in ext4, which was interesting, but didn’t draw as big a crowd.

I got asked “What are you doing at Akamai ?” a lot. (answer right now: trying to bring some coherence to our multiple test infrastructures).
Second most popular question: “What are going to do after that ?”. (answer: unknown, but likely something more related to digging into networking problems rather than fighting shell scripts, perl and Makefiles).

All that, plus a lot of hallway conversations, long lunches, and evening activities that went on possibly a little later than they should have have led to me almost losing my voice today.
Really good use of time though. I had fun, and it’s always good to catch up with various people.