The paper is Dynamo: Amazon’s Highly Available Key-value Store, by a number of good folks over at Amazon.com (including the CTO Werner Vogels). It was published in the peer reviewed SOPS '07.
The bulk of the paper is about Dynamo, which apparently is a home grown key-value storage system that has a lot of inherent replication, scalability, and reliability. A good read.
Instead of the big picture I wanted to focus on a detail. The paper rejects the notion of defining SLAs with median expectations. Instead, Dynamo uses 99.9 percentiles. (It also rejects using means, but that is pretty nonsensical anyhow, isn't it?). The central idea is that the SLA defines acceptable usage for all users - not just for half of them (the median).
This matters in real life in a very real way. There is normally one version of an operation and then an ala carte menu of choices they might layer on that.. lots of folks use the vanilla version, but if 10, 20, 30, or even 40 percent of users are using some features that require extra processing - they are totally left out of the SLA. The canonical Amazon case is a user with a very long purchasing history - an important case but not the average case. 99.9 installs a threshold for a reasonable definition of "everybody" while still leaving room for the occasional pathological case. The authors point out that for more money you can have more nines, but the law of diminishing returns certainly applies.
You have to wonder if after today's S3 outage they wish they had bought another nine or two ;) (I have no reason to think that had anything to do with Dynamo - I'm just poking fun - what Amazon has built for S3/EC2 is very impressive)
I was happily nodding along as I read the paper when this came up:
Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred millisecondsMy initial reaction was not charitable: Whoa - that's not really setting the bar very high, is it guys? Hundreds of millis to read and write a key/value pair? You say yourself in the introductory pages that these services are often layered on top of each other! Sure, it is more latency sensitive than an overnight data warehouse operation, but that's hardly an impressive responsiveness threshold.
But, I was too harsh. I hadn't internalized what shifting from the 50th percentile to the 99.9th really meant. The value isn't representative of what the typical user will see - it is representative of the worst you can stomach. In effect, it loses its marketing value - which is how a lot of SLAs are used in the real world.
The Dynamo paper backs this up. Figure 4 shows both the average and 99.9 percentiles for both reads and writes. average reads ran around 15ms, writes around 25. 99.9 percentiles were respectively in the 150 and 250 neighborhoods. This all makes much more sense, especially when dealing with disk drives.
In the end, I think you need more than one datapoint in order to make an effective characterization. Amazon clearly wouldn't be happy if everybody was seeing 150ms latencies - though they can stomach it if literally it is a one in a thousand kind of occurrence.
Maybe SLAs should be expressed as 4 tuples of 25/50/75/99.9 .. I've developed benchmarks that way and felt that helped keep the subsequent optimizations honest.
Even 90/99.9 is mostly what you need to keep a cap on the outliers while still getting a feel for what somebody is likely to see.