Friday, April 18, 2008

Measuring performance of Linux Kernel likely() and unlikely()

A little while back I wrote about how prominent likely() and unlikely() are in the Linux kernel, and yet I could not find any performance measurements linked to them.

Today I made some measurements myself.

But first a quick review - likely and unlikely are just macros for gcc's __builtin_expect(), which in turn allows the compiler to generate code compatible with the target architecture's branch prediction scheme. The GCC documentation really warns against using this manually too often:

You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
The kernel certainly makes liberal use of it. Accroding to LXR 2.6.24 had 1608 uses of likely and 2075 uses of unlikely in the code. LXR didn't have an index of the just released 2.6.25 yet - but I'd bet it is likely to be more now.

My methodology was simple, I choose several benchmarks commonly used in kernel land and I ran them against vanilla 2.6.25 and also against a copy I called "notlikely" which simply had the macros nullified using this piece of programming genius:




The tests I ran were lmbench, netperf, bonnie++, and the famous "how fast can I compile the kernel?" test.

The test hardware was an all 64 bit setup on a 2.6Ghz core-2 duo with 2GB of ram and a SATA disk. Pretty standard desktop hardware.

The core 2 architecture has a pretty fine internal branch prediction engine without the help of these external hints. But with such extensive use of the macros (3500+ times!), I expected to see some difference shown by the numbers.

But I didn't see any measurable difference. Not at all.

Not a single one of those tests showed anything that I wouldn't consider overlapping noise. I had 3 data points for each test on each kernel (6 points per test) and each test had several different facets. Out of dozens of different facets, there wasn't a single criteria where the measurement was always better or worse on one kernel.

And this disappoints me. Because I like micro optimizations damn it! And in general this one seems to be a waste of time other than the nice self documenting code it produces. Perhaps the gcc advice is correct. Perhaps the Core-2 is so good that this doesn't matter. Perhaps there is a really compelling benchmark that I'm just not running.

I say it is a waste in general because I am sure there are specific circumstances and code paths where this makes a measurable difference. There certainly must be a benchmark that can show it - but none of these broad based benchmarks were able to show anything useful. That doesn't mean the macro is over used, it seems harmless enough too, but it probably isn't worth thinking too hard about it either.

hmm.