Friday, April 18, 2008

Measuring performance of Linux Kernel likely() and unlikely()

A little while back I wrote about how prominent likely() and unlikely() are in the Linux kernel, and yet I could not find any performance measurements linked to them.

Today I made some measurements myself.

But first a quick review - likely and unlikely are just macros for gcc's __builtin_expect(), which in turn allows the compiler to generate code compatible with the target architecture's branch prediction scheme. The GCC documentation really warns against using this manually too often:

You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
The kernel certainly makes liberal use of it. Accroding to LXR 2.6.24 had 1608 uses of likely and 2075 uses of unlikely in the code. LXR didn't have an index of the just released 2.6.25 yet - but I'd bet it is likely to be more now.

My methodology was simple, I choose several benchmarks commonly used in kernel land and I ran them against vanilla 2.6.25 and also against a copy I called "notlikely" which simply had the macros nullified using this piece of programming genius:




The tests I ran were lmbench, netperf, bonnie++, and the famous "how fast can I compile the kernel?" test.

The test hardware was an all 64 bit setup on a 2.6Ghz core-2 duo with 2GB of ram and a SATA disk. Pretty standard desktop hardware.

The core 2 architecture has a pretty fine internal branch prediction engine without the help of these external hints. But with such extensive use of the macros (3500+ times!), I expected to see some difference shown by the numbers.

But I didn't see any measurable difference. Not at all.

Not a single one of those tests showed anything that I wouldn't consider overlapping noise. I had 3 data points for each test on each kernel (6 points per test) and each test had several different facets. Out of dozens of different facets, there wasn't a single criteria where the measurement was always better or worse on one kernel.

And this disappoints me. Because I like micro optimizations damn it! And in general this one seems to be a waste of time other than the nice self documenting code it produces. Perhaps the gcc advice is correct. Perhaps the Core-2 is so good that this doesn't matter. Perhaps there is a really compelling benchmark that I'm just not running.

I say it is a waste in general because I am sure there are specific circumstances and code paths where this makes a measurable difference. There certainly must be a benchmark that can show it - but none of these broad based benchmarks were able to show anything useful. That doesn't mean the macro is over used, it seems harmless enough too, but it probably isn't worth thinking too hard about it either.

hmm.

Monday, April 14, 2008

Monitoring IP changes with NETLINK ROUTE sockets

Yesterday I read a mailing list query asking how to get event driven Linux IP address changes (i.e. without having to poll for them).

I agreed with the attitude of the post. The most important thing about scaling is to make sure the work your code does is proportional to the real event stream. Seems obvious enough, but lots of algorithms screw up that basic premise.

Any time based polling algorithm betrays this scaling philosophy because work is done every tick independent of the events to be processed. You're always either doing unnecessary work or adding latency to real work by waiting for the next tick. The select() and poll() APIs also betray it as these are proportional to the amount of potential work (number of file descriptors) instead of the amount of real work (number of active descriptors) - epoll() is a better answer there.

Event driven is the way to go.

Anyhow, back to the original poster. I knew netlink route sockets could do this - and I had used them in the past for similar purposes. I had to get "man 7 netlink"and google going to cobble together an example and only then did I realize how difficult it is to get started with netlink - it just has not been very widely used and documented.

So the point of this post is to provide a little google juice documentation for event driven monitoring of new IPv4 addresses using netlink. At least this post has a full example - the code is below.

If you need to use this functionality - I recommend man 3 and 7 of both netlink and rtnetlink.. and then go read the included header files, and use my sample as a guide. In this basic way you can get address adds, removals, link state changes, route changes, interface changes, etc.. lots of good stuff. It is at the heart of the iproute tools (ip, ss, etc..) as well most of the userspace routing software (zebra, xorp, vyatta, etc..).

Tuesday, April 8, 2008

IP Georeferencing

IP Georeferencing is a pretty cool toolbox item on today's web. Simply put, its the process of taking an IP address and converting it to some geographical (city, lat, long, whatever) information.

It gets commonly used for security filters, log analysis, ad targeting, community building, etc..

Maxmind deserves a shout out for making available databases to do this. There is nothing inherent about an IP number that gives you this information, so they just ned to build out of band databases to get the job done. They have two copies of the databases, one for free and one a bit more accurate that is priced very reasonably.

The databases come with a number of different libraries for using them: C, Java, PHP, etc.. The libraries are released under the LGPL.

Recently I was doing a project that needed to lookup scads and scads of addresses, so I put a little muscle into improving the lookup routines in the C code.

I'm happy to say I was able to improve things anywhere from 2x to 8x in terms of overall lookups, depending on what exactly was being lookedup. There were a bunch of changes, but the primary one was the addition of a variable length radix lookup mechanism that changed the average number of comparisons from 27 to 4 - not rocket science, but the right tool for the job.

I'm even more happy to say I sent those back as LGPL contributions of my own. The code is on my open source contribution web page.

Saturday, April 5, 2008

Linux Selective Acknowledgment (SACK) CPU Overhead

Last year I tossed an e-mail from the Linux kernel networking list in my "projtodo" folder.

The mail talked about how the Linux TCP stack in particular, and likely all TCP stacks in general, likely had an excessive-CPU attack exposure when confronted with malicious SACK options. I found the mail intriguing but unsatisfying. It was well informed speculation but didn't have any hard data, nor was there any simple way to gather some. Readers of the other posts on this blog will know I really dig measurements. The issue at hand was pretty obviously is a problem - but how much of one?

A few weeks ago I had the chance to develop some testing code and find out for myself - and IBM DeveloperWorks has published the summary of my little project. The executive summary is "its kinda bad, but not a disaster, and hope is on the way". There is had data and some pretty pictures in the article itself.

The coolest part of the whole endeavor, other than scratching the "I wonder" itch, was getting to conjure up a userspace TCP stack from raw sockets. It was, of course, woefully incomplete as it was just meant to trigger a certain behavior in its peer instead of being generally useful or reliable - but nonetheless entertaining.

Thursday, April 3, 2008

Linux Kernel - likely() not measured?

The other day on kernelnewbies, the able Robert Day wondered whether or not anyone had quantified the effects of the likely() and unlikely() macros scattered all over the Linux kernel.

He received no less than 5 replies telling him how the macros worked. (If you are curious, this is the best explanation) - but not a single piece of measurement or other characterization.

LXR shows over 3500 uses of those macros in the kernel, and nobody has any data for any scenario? Wowzers.

Doing before/after benchmarks with those macros changed to nops would be an interesting project. Could use the usual suspects of linux kernel performance enhancements to test (lmbench, compile test, some kind of network load generator, etc..)

Comments with pointers to data would be cool.

Monday, March 10, 2008

The Everything Old Is New Again Meme

I've been struck lately by the dismissive comments in technology reviews along the lines of "of course this is really just a new skin on a really old idea". They are particularly striking, because the reviews are essentially overall positive about the technology they just dismissed as a rehash! If they really were just recycled implementations of old worn ideas, then why are we so excited now - why waste the bits talking about them?

I'm left thinking that there just isn't enough respect for a quality implementation that takes into account the real needs of the current market. To some, it is all about the idea. I'm the last to diss a great idea, but let's face it - folks who think the idea is everything tend to overlook various little bits of reality that don't mesh well with their idea (we call those implementation details) and in general just commoditize the implementation. Ideas like this make great conversations but crappy products.

The truth is these next generation technologies are usually quality implementation with some of the substantial "implementaion details" overcome. To trivialize those details is a mistake - the difficulty of a quality implementation is often overlooked by folks who think the main idea is everything. Often they require some real innovative thinking on their own. Anybody who has taken a tour of duty in one (or two or three) startups will tell you that neither idea nor execution are to be taken for granted - this is hard stuff when you're blazing new trails.

A couple common examples:
Thinking about just these two examples, it isn't hard to see why they are breaking through now instead of the earlier incarnations made reference to. While market conditions and external factors have changed, it isn't simply that their train has arrived and they were dusted off for the occasion - real work has made them better.

Tuesday, March 4, 2008

Calgary IOMMU - At What Price?

The Calgary IOMMU is a feature of most IBM X-Series (i.e. X86_64) blades and motherboards. If you aren't familiar with an IOMMU, it is strongly analogous to a regular MMU but applied to a DMA context. Their original primary use was for integrating 32 bit hardware into 64 bit systems. But another promising use for them is enforcing safe access in the same way an MMU can.

In normal userspace, if a program goes off the rails and accesses some memory it does not have permissions for a simple exception can be raised. This keeps the carnage restricted to the application that made the mistake. But if a hardware device does the same thing through DMA, whole system disaster is sure to follow as nothing prevents the accesses from happening. The IOMMU can provide that safety.

An IOMMU unit lets the kernel setup mappings much like normal memory page tables. Normal RAM mappings are cached with TLB entries, and IOMMU maps are cached TCE entries that play largely the same role.

By now, I've probably succeeded in rehashing what you already knew. At least it was just three paragraphs (well, now four).

The pertinent bit from a characterization standpoint is a paper from the 2007 Ottawa Linux Symposium. In The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda of IBM and some co-authors from Intel and AMD do some measurements using the Calgary IOMMU, as well as the DART (which generally comes on Power based planers).

I love measurements! And it takes guts to post measurements like this - in its current incarnation on Linux the cost of safety from the IOMMU is a 30% to 60% increase in CPU! Gah!

Some drill down is required, and it turns out this is among the worst cases to measure. But still - 30 to 60%! The paper is short and well written, you should read it for yourself - but I will summarize the test more or less as "measure the CPU utilization while doing 1 Gbps of netperf network traffic - measure with and without iommu". The tests are also done with and without Xen, as IOMMU techniques are especially interesting to virtualization, but the basic takeaways are the same in virtualized or bare metal environments.

The "Why so Bad" conclusion is management of the TCE. The IOMMU, unlike the TLB cache of an MMU, only allows software to remove entries via a "flush it all" instruction. I have certainly measured that when TLBs need to be cleared during process switching that can be a very measurable event on overall system performance - it is one reason while an application broken into N threads runs faster than the same application broken into N processes.

But overall, this is actually an encouraging conclusion - hardware may certainly evolve to give more granular access to the TCE tables. And there are games that can be played on the management side in software that can reduce the number of flushes in return for giving up some of the safety guarantees.

Something to be watched.

Thursday, February 28, 2008

Appliances and Hybrids

In my last post, which was about the fab book Network Algorithmics, I mentioned intelligent in-network appliances. I also mentioned that "the world doesn't need to be all hardware or all software".

I believe this to be an essential point - blending rapidly improving commodity hardware with a just a touch of custom ASIC/FPGA components, glued together by a low level architecture aware systems programming approach makes unbelievably good products that can be made an produced relatively inexpensively.

These appliances and hardware are rapidly showing up everywhere - Tony Bourke of Load Balancing digest makes a similar (why choose between ASIC or Software?) post today. Read it.

Wednesday, February 27, 2008

Network Algorithmics

In late 2004 George Varghese published an amazing book on the design and implementation of networking projects: Network Algorithmics

You won't find the boilerplate "this is IP, this is TCP, this is layer blah" crapola here. What makes this special is that it describes the way the industry really builds high end switches, routers, high end server software and an ever growing array of intelligent in-network appliances. The way that is really done, is very different than presented in a classic networking text book.

The material is not algorithms, rather it is a way of looking at things and breaking down abstraction barriers through techniques like zero copy, tag hinting, lazy evaluation, etc.. It makes the point that layers are how you describe protocols - not how you want to implement them. The world doesn't have to be all hardware or all software, this helps train your mind on how to write systems that harmonize them both by taking the time to really understand the architecture.

To me, this book is a fundamental bookshelf item. You don't hear it often mentioned with the likes of Stevens, Tanenbaum, and Cormen - but amongst folks in the know, it is always there. More folks ought to know.

Friday, February 15, 2008

SLA as medians, percentiles, or averages - as told by Amazon's Dynamo

I am going to recommend another published paper because of how it talks about characterizing its own performance. That's not the point of the paper, but I found it really interesting anyhow.

The paper is Dynamo: Amazon’s Highly Available Key-value Store, by a number of good folks over at Amazon.com (including the CTO Werner Vogels). It was published in the peer reviewed SOPS '07.

The bulk of the paper is about Dynamo, which apparently is a home grown key-value storage system that has a lot of inherent replication, scalability, and reliability. A good read.

Instead of the big picture I wanted to focus on a detail. The paper rejects the notion of defining SLAs with median expectations. Instead, Dynamo uses 99.9 percentiles. (It also rejects using means, but that is pretty nonsensical anyhow, isn't it?). The central idea is that the SLA defines acceptable usage for all users - not just for half of them (the median).

This matters in real life in a very real way. There is normally one version of an operation and then an ala carte menu of choices they might layer on that.. lots of folks use the vanilla version, but if 10, 20, 30, or even 40 percent of users are using some features that require extra processing - they are totally left out of the SLA. The canonical Amazon case is a user with a very long purchasing history - an important case but not the average case. 99.9 installs a threshold for a reasonable definition of "everybody" while still leaving room for the occasional pathological case. The authors point out that for more money you can have more nines, but the law of diminishing returns certainly applies.

You have to wonder if after today's S3 outage they wish they had bought another nine or two ;) (I have no reason to think that had anything to do with Dynamo - I'm just poking fun - what Amazon has built for S3/EC2 is very impressive)

I was happily nodding along as I read the paper when this came up:
Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds
My initial reaction was not charitable: Whoa - that's not really setting the bar very high, is it guys? Hundreds of millis to read and write a key/value pair? You say yourself in the introductory pages that these services are often layered on top of each other! Sure, it is more latency sensitive than an overnight data warehouse operation, but that's hardly an impressive responsiveness threshold.

But, I was too harsh. I hadn't internalized what shifting from the 50th percentile to the 99.9th really meant. The value isn't representative of what the typical user will see - it is representative of the worst you can stomach. In effect, it loses its marketing value - which is how a lot of SLAs are used in the real world.

The Dynamo paper backs this up. Figure 4 shows both the average and 99.9 percentiles for both reads and writes. average reads ran around 15ms, writes around 25. 99.9 percentiles were respectively in the 150 and 250 neighborhoods. This all makes much more sense, especially when dealing with disk drives.

In the end, I think you need more than one datapoint in order to make an effective characterization. Amazon clearly wouldn't be happy if everybody was seeing 150ms latencies - though they can stomach it if literally it is a one in a thousand kind of occurrence.

Maybe SLAs should be expressed as 4 tuples of 25/50/75/99.9 .. I've developed benchmarks that way and felt that helped keep the subsequent optimizations honest.

Even 90/99.9 is mostly what you need to keep a cap on the outliers while still getting a feel for what somebody is likely to see.

Thursday, January 31, 2008

AjaxScope Paper Provides Javascript Characterization

Emre Kıcıman and Benjamin Livshits from Microsoft Research present some interesting data in their SOPS paper - AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications.

The paper is mostly about AjaxScope, a neat insturmentation and profiling tool for JavaScript.

What I want to highlight, though, are some of the measurements they present for both IE (6 and 7) as well as Firefox (2). This is real data in a refereed journal of the ACM, not a Gartner-style whitepaper.

Among the interesting nuggets are IE's 35x slower performance in String cat operations, and Firefox's 4x slower Array join execution time. The authors also put the intrinsics into context by measuring the performance of common portal pages - IE beats Firefox on msn.com, but Firefox turns the tables on Yahoo!.

Lots more interesting data, and a useful tool, in the paper. Read it.

Saturday, January 19, 2008

Web Syndication Format Market Share

For quite a while my todo list has had an instruction to find and characterize the popularity of RSS vs ATOM . Which syndication format is more popular?

Atom seems on the face of it to be a better format than RSS, but some of what it addresses are not really wide spread problems for operations. Market share will tell if it was a solution looking for a real problem or not. Atom is about 2 years old - and it is pretty common to see atom feeds available around the net now.

Measuring the breakdown among my own set of feeds that I read isn't terribly useful. I have a bias in my selections - it isn't like measuring my connectivity or transport properties where I am representative as a sample.

For the record: I have 112 feeds, 54 of them in atom and 58 in some kind of rss.

The best information I could find was from syndic8.com. But frankly, it wasn't very satisfying. The site didn't feel very complete, and in the end only showed essentially the ratio between RSS and Atom offerings. They listed about 1/2 a million feeds - 82% of which were some flavor of RSS.

What I want to know is the ratio between active usages (i.e. fetches) of the two formats. Lots of sites offer both formats - but which do users actually consume?

Feedburner clearly has this info - but I couldn't find it published anywhere.

Does anybody have more information?