Testing the leap second code. Again.

Judging by the number of searches hitting my blog this last few weeks, some people seem to be concerned about the leap second event at the end of the month. The common search pattern is variations of “how do I test for leap second bug”.

The code in question has been given some scrutiny over recent weeks, and some fixes got merged..

fad0c66c4bb836d57a5f125ecd38bed653ca863a timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
dd48d708ff3e917f6d6b6c2b696c3f18c019feed ntp: Correct TAI offset during leap second
6b43ae8a619d17c4935c3320d2ef9e92bdeed05d ntp: Fix leap-second hrtimer livelock

Just to put my mind at rest, I booted a kernel with this patch..

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 70b33ab..5f4a1f0 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -396,6 +396,13 @@ int second_overflow(unsigned long secs)

spin_lock_irqsave(&ntp_lock, flags);

+ if (time_state == TIME_OK) {
+ if (secs % 86400 == 0) {
+ printk("Faking leap second.\n");
+ time_state = TIME_INS;
+ }
+ }
* Leap second processing. If in leap-insert state at the end of the
* day, the system clock is set back one second; if in leap-delete

Which futzes with the leap second state machine to make it think it’s Jun 31st, and it’s just been told to insert a leap second.
At midnight UTC, things were pretty boring..

Jun 14 19:59:59 kernel: [68242.572516] Faking leap second.
Jun 14 19:59:59 kernel: [68242.573379] Clock: inserting leap second 23:59:60 UTC

And then things just kept running as normal. As they should. So hopefully, this time we’ll have an uneventful leap second, unlike the past few.
(as long as you’re running a recent-ish kernel that is).

Update: Richard Cochran has a test program here, which doesn’t involve hacking the kernel code.

Last post on leap seconds.

I thought I was done with this. Then, today I saw this. To the best of my knowledge, Fedora 8 didn’t suffer from the bug I originally described several posts ago. I think this one happening at nearly midnight UTC is coincidence.

There’s a “me too” in the comments, but it seems odd that two people on slashdot saw it, but we never heard a peep on the Fedora mailing lists, or in bugzilla. Or even in upstream kernel.org. It could just be coincidence, the story is unsurprisingly short on details. I guess slashdot stories are easier to write than bug reports. But without additional debugging info we won’t ever know. Bear in mind that last time we saw a crash of this nature it didn’t affect everyone then either.

It was only by chance I managed to catch the backtrace in the `06 crash. I actually had two locked up machines, but one had its screen blanked, and wouldn’t unblank. The other machine had blanking disabled (setterm -blank 0) and thankfully, had also been set up to use a VGA screen resolution so had plenty of lines to display the whole backtrace.

Update: a problem has been found, and fixed.

More on leap seconds.

Jesse Keating made a comment in my previous post on leap seconds, which I thought was worth highlighting in another post, for the benefit of those who don’t read the comments.

This is why rarely executed codepaths suck. Whilst it is tempting to gloat over another Microsoft failure, this could easily have been any other OS. I already mentioned that Linux had suffered something similar once. A bug like this in consumer devices is a nightmarish, but imagine if such a bug ended up in something more critical ? “Sorry, your life support system went offline because there was a leap second”. In safety critical systems, rare codepaths are kind of terrifying.

Writing test cases for bugs like this is also not particularly fun. You’d have to have a fake ntp server for testing the rare case.
Now think about all the other potential ‘only runs once every blue moon’ codepaths in your apps, and imagine the effort required to write test plans for all of them. Not impossible, but certainly a lot of potential job security there for QA folks. Just like fuzz-testing, traditional coverage-testing by just running common workloads aren’t the panacea of testing when there are variables outside your control.

What’s still puzzling to me though.. The Zunes died several hours before 00:00:00 UTC.
Quirk of MSFT’s ntp implementation I guess. *shrug*

Leap seconds.

Tonight, a leap second will occur. After 23:59.59, we have 23:59:60 before rolling over to 00:00:00. Most people won’t even notice. Most electronic devices won’t notice. Those unaware of the event (like the clock on my microwave oven), will end up a second slower. (not that it really matters, it doesn’t display seconds in its clock, and I surely wasn’t second-accurate when I set it).

Of slightly more concern, are the more clever devices. The devices that are aware of leap seconds know when to insert one. On these internet connected devices, ntpd tells the kernel “insert or deduct a second” as necessary.
This all sounds fairly benign, but it has been known to be problematic. For reasons I’m not entirely sure of, ntp still calls into the kernel twice a year, regardless of whether a leap second is inserted or not. So, twice a year, we end up in different code paths that we don’t execute the rest of the year.

Whilst I was travelling in June 2006, I noticed I couldn’t get at my email. A week passed before I found out on returning home that the kernel had oopsed in that code path. There was no leap second in June that year. Nor has there been in any year this decade. Thankfully, that particular oops was only fatal if you were running a build with certain debugging CONFIG options turned on (I was), so that vast majority of users never saw a problem. Here’s the fix that went into 2.6.22 for this bug.

The very few that did see the problem (I don’t recall anyone else mentioning it when I posted to lkml) likely just rebooted, with the “if it happens again, I’ll report it” mindset, which of course, it didn’t..

Hopefully at midnight, all will be well and that code will just do it’s thing with no dire consequences :-)

update: anti-climax, just as we like it :-)

Dec 31 18:59:59 localhost kernel: Clock: inserting leap second 23:59:60 UTC