Friday, September 23, 2011

SPDY: What I Like About You.

I've been working on implementing SPDY as an experiment in Firefox lately. We'll have to see how it plays out, but so far I really like it.

Development and benchmarking is still a work in progress, though interop seems to be complete. There are several significant to-do items left that have the potential to improve things even further. The couple of anecdotal benchmarks I have collected are broadly similar to the page load time based reports Google has shared at the IETF and velocity conf over the last few months.

tl;dr; Faster is all well and good (and I mean that!) but I'm going to make a long argument that SPDY is good for the Internet beyond faster page load times. Compared to HTTP, it is more scalable, plays nicer with other Internet traffic and brings web security forward.

SPDY: What I Like About You

#1: Infinite Parallelism with Shared Congestion Control.

You probably know that SPDY allows multiplexing of multiple HTTP resources inside one TCP stream. Unlike the related HTTP mechanisms of pipelining concurrent requests on one TCP stream, the SPDY resources can be returned in any order and even mixed together in small chunks so that head of line blocking is never an issue and you never need more than one connection to each real server. This is great for high latency environments because a resource never needs to be queued on either the client or the server for any reason other than network congestion limits.

Normal HTTP achieves transaction parallelism through parallel TCP connections. Browsers limit you to 6 parallel connections per host. Servers achieve greater parallelism by sharding their resources across a variety of host names. Often these host names are aliases for the same host, implemented explicitly to bypass the 6 connection limitation. For example, and are actually DNS CNAMEs for the same server. It is not uncommon to see performance oriented sites, like the Google properties, shard things over as many as 6 host names in order to allow 36 parallel HTTP sessions.

Parallelism is a must-have for performance. I'm looking at a trace right now that uses the afore mentioned 36 parallel HTTP sessions and its page load completes in 16.5 seconds. If I restrict it to just 1 connection per host (i.e. 6 overall), the same page takes 27.7 seconds to load. If I restrict that even further to just 1 connection in total in takes a mind numbing 94 seconds to load. And this is on 40ms RTT broadband - high latency environments such as mobile would suffer much much worse! Keep this in mind when I start saying bad things about parallel connections below, they really do great things and the web we have with them enables much more impressive applications than a single connection HTTP web ever could.

Of course using multiple parallel HTTP connections is not perfect - if they were perfect we wouldn't try to limit them to 6 at a time. There are two main problems. The first is that each connection requires a TCP handshake which incurs an extra RTT (or maybe 3 if you are using SSL) before the connection can be used. The TCP handshake is also relatively computationally hard compared to moving data (servers easily move millions of packets per second, while connection termination is generally measured in the tens of thousands), the SSL handshake even harder. Reducing the number of connections reduces this burden. But in all honesty this is becoming less of a problem over time - the cost of maintaining persistent connections is going down (which amortizes the handshake cost) and servers are getting pretty good at executing the handshakes (both SSL and vanilla) sometimes by employing the help of multi-tiered architectures for busy deployments.

The architectural problem lies in HTTP's interaction with TCP congestion control. HTTP flows are generally pretty short (a few packets per transaction), tend to stop and start a lot, and more or less play poorly with the congestion control model. The model works really well for long flows like a FTP download - that TCP stream will automatically adapt to the available bandwidth of the network and transfer at a fairly steady rate for its duration after a little bit of acclimation time. HTTP flows are generally too short to ever acclimate properly.

A SPDY flow, being the aggregation of all the parallel HTTP connections, looks to be a lot longer, busier, and more consistent than any of the individual parallel HTTP flows would be. Simply put - that makes it work better because all of that TCP congestion logic is applied to one flow instead of being repeated independently across all the parallel HTTP mini flows.

Less simply, when an idle HTTP session begins to send a response it has to guess at how much data should be put onto the wire. It does this without awareness of all the other flows. Let's say it guesses "4 packets" but there are no other active flows. In this case 4 packets is way too few and the network is under utilized and the page loads poorly. But what if 35 other flows are activated at the same time - this means 140 packets get injected into the network at the same time which is way too many. Under that scenario one of two things happen - both of them are bad:
  1. Packet Loss. TCP reacts poorly to packet loss, especially on short flows. While 140 packets in aggregate is a pretty decent flow, remember that total transmission is made up of 35 different congestion control blocks - each one covering a packet flow of only 4 packets. A loss is devastating to performance because most of the TCP recovery strategies don't work well in that environment.
  2. Over Buffering. This is what Jim Gettys calls bufferbloat. The giant fast moving 140 packet burst arrives at your cable modem head where the bandwidth is stepped down and most of those packets get in a long buffer to wait for their turn on your LAN. That works OK, certainly better than packet loss recovery does in practice, but the deep queue creates a giant problem for any interactive traffic that is sharing that link. Packets for those other applications (such as VOIP, gaming, video chat, etc..) now have to sit in this long queue resulting in interactive lag. Lag sucks for the Internet. The HTTP streams themselves also become non-responsive to cancel events because the only way to clear those queues is to wait them out - so clicking on a link to a new page is significantly delayed while the old page that you have already abandoned continues to consume your bandwidth.
This describes a real dilemma - if you guess more aggressive send windows then you will have a better chance of filling the network but you will also have a better chance of packet loss or over buffering. If you guess more conservative windows then loss and buffering happens less often but nothing ever runs very quickly. In the face of all those flows with independent congestion control blocks, there just isn't enough information available. (This of course is related to the famous Initial Window 10 proposal, which I support, but that's another non SPDY story.)

I'm sure you can see where this is going now. SPDY's parallelism, by virtue of being on a single TCP stream, leverages one busy shared congestion control block instead of dealing with 36 independent tiny ones. Because the stream is much busier it rarely has to guess at how much to send (you only need to guess when you're idle, SPDY is more likely to be getting active feedback), if it should drop a packet it reacts to that loss much better via the various fast recovery mechanisms of TCP, and when it is competing for bandwidth at a choke point it is much more responsive to the signals of other streams - reducing the over buffering problem.

It is for these reasons that SPDY is really exciting to me. Parallel connections work great - they work so great that it is hard to have SPDY significantly improve on the page load time of highly sharded site unless there is a very high amount of latency present.  But the structural advantages of SPDY enable important efforts like RTCWeb as well as provide better network utilization and help servers scale when compared to HTTP. Even if page load times only stay at par, those other good for the Internet attributes make it worth deploying.

#2: SPDY is over SSL every time.

I greatly lament that I am late to the school of SSL-all-the-time. I spent many years trying to eek the greatest amount of server responses per watt that was possible. I looked at SSL and saw impediments. That stayed with me.

I was right about the impediments, and I've learned a lot about dealing with them, but what I didn't get was that it is simply worth the cost. As we have all seen lately, SSL isn't perfect - but having a layer of protection against an entire class of eavesdropping attacks is a property that should be able to be relied upon in every protocol as generic as HTTP. HTTP does not provide that guarantee - but SPDY does.


As a incentive to make the transition to SSL all the time, this makes it worth deploying by itself.

#3:Header compression. 

SPDY compresses all the HTTP-equivalent headers using a specialized dictionary and a compression context that is reserved only for the headers so it does not get diluted with non-header references. This specialized scheme performs very well.

At first I really didn't think that would matter very much - but it is really a significant savings. HTTP's statelessness had its upsides, but the resulting on the wire redundancy was really over the top.

As a sample, I am looking right now at a trace of 1900 resources (that's about 40 pages). 760KB of total downstream plain text header bytes were received as 88KB compressed bytes, and upstream 949KB of plain text headers were compressed as just 65KB. I'll take 1.56MB (82%) in total overhead savings!  I even have a todo item that will make this slightly better.