Monday, March 10, 2008

The Everything Old Is New Again Meme

I've been struck lately by the dismissive comments in technology reviews along the lines of "of course this is really just a new skin on a really old idea". They are particularly striking, because the reviews are essentially overall positive about the technology they just dismissed as a rehash! If they really were just recycled implementations of old worn ideas, then why are we so excited now - why waste the bits talking about them?

I'm left thinking that there just isn't enough respect for a quality implementation that takes into account the real needs of the current market. To some, it is all about the idea. I'm the last to diss a great idea, but let's face it - folks who think the idea is everything tend to overlook various little bits of reality that don't mesh well with their idea (we call those implementation details) and in general just commoditize the implementation. Ideas like this make great conversations but crappy products.

The truth is these next generation technologies are usually quality implementation with some of the substantial "implementaion details" overcome. To trivialize those details is a mistake - the difficulty of a quality implementation is often overlooked by folks who think the main idea is everything. Often they require some real innovative thinking on their own. Anybody who has taken a tour of duty in one (or two or three) startups will tell you that neither idea nor execution are to be taken for granted - this is hard stuff when you're blazing new trails.

A couple common examples:
Thinking about just these two examples, it isn't hard to see why they are breaking through now instead of the earlier incarnations made reference to. While market conditions and external factors have changed, it isn't simply that their train has arrived and they were dusted off for the occasion - real work has made them better.

Tuesday, March 4, 2008

Calgary IOMMU - At What Price?

The Calgary IOMMU is a feature of most IBM X-Series (i.e. X86_64) blades and motherboards. If you aren't familiar with an IOMMU, it is strongly analogous to a regular MMU but applied to a DMA context. Their original primary use was for integrating 32 bit hardware into 64 bit systems. But another promising use for them is enforcing safe access in the same way an MMU can.

In normal userspace, if a program goes off the rails and accesses some memory it does not have permissions for a simple exception can be raised. This keeps the carnage restricted to the application that made the mistake. But if a hardware device does the same thing through DMA, whole system disaster is sure to follow as nothing prevents the accesses from happening. The IOMMU can provide that safety.

An IOMMU unit lets the kernel setup mappings much like normal memory page tables. Normal RAM mappings are cached with TLB entries, and IOMMU maps are cached TCE entries that play largely the same role.

By now, I've probably succeeded in rehashing what you already knew. At least it was just three paragraphs (well, now four).

The pertinent bit from a characterization standpoint is a paper from the 2007 Ottawa Linux Symposium. In The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda of IBM and some co-authors from Intel and AMD do some measurements using the Calgary IOMMU, as well as the DART (which generally comes on Power based planers).

I love measurements! And it takes guts to post measurements like this - in its current incarnation on Linux the cost of safety from the IOMMU is a 30% to 60% increase in CPU! Gah!

Some drill down is required, and it turns out this is among the worst cases to measure. But still - 30 to 60%! The paper is short and well written, you should read it for yourself - but I will summarize the test more or less as "measure the CPU utilization while doing 1 Gbps of netperf network traffic - measure with and without iommu". The tests are also done with and without Xen, as IOMMU techniques are especially interesting to virtualization, but the basic takeaways are the same in virtualized or bare metal environments.

The "Why so Bad" conclusion is management of the TCE. The IOMMU, unlike the TLB cache of an MMU, only allows software to remove entries via a "flush it all" instruction. I have certainly measured that when TLBs need to be cleared during process switching that can be a very measurable event on overall system performance - it is one reason while an application broken into N threads runs faster than the same application broken into N processes.

But overall, this is actually an encouraging conclusion - hardware may certainly evolve to give more granular access to the TCE tables. And there are games that can be played on the management side in software that can reduce the number of flushes in return for giving up some of the safety guarantees.

Something to be watched.