I've made a living dealing with fast networks and servers that run at really impressive transaction rates using all manner of nifty interconnects and parallelism. Sometimes I forget that the day to day web isn't all that fast in comparison.
My local copy of Firefox is annotated to dump a bunch of network stats when shutting down. One of them is a CDF of HTTP handshake times. This is from my desktop, which is connected to a premium cable broadband consumer Internet service. It's not as awesome as FIOS, but its still at the top portion of what a home consumer will have in the US, which in turn has certain geographic advantages when connection to many hosting companies. Its fair to say my performance is going to be at least a bit better than the average Internet user. And it is still slow. (We of course need to work to be able to better characterize what the real spectrum of experience is.)
This isn't scientific. It is where I happen to browse, and its just one datapoint although I can tell you my gut says it is pretty typical output - gathered over 15,000 connections.
Only about half of my handshakes are where I want them to be: < 100ms. Most of the rest fall in the next 300ms. To be fair there is a little skew in here because the code doesn't separate https from http, and SSL has an extra RTT in there. But SSL is a small fraction of the overall sample.
And this is the desktop. Think mobile and wireless.
Latency matters.
Real data and musings on the performance of networks, servers, protocols, and their related folks.
Tuesday, February 15, 2011
Monday, February 14, 2011
The Apex of Pipelines
Every once in a while I'm still surprised at the potential upside of pipelines.
I stumbled across a great example recently: Women In Technology International. That home page is setup in a pretty typical newsletter format. It has 159 resources, 145 of which are images along with about a half dozen pieces of js and css. Most of the images are small, with over 2/3 of them loading in less than 20ms of transfer time (time to first byte removed).
What is striking about this page is how large of an advantage pipelining can give even on a well connected broadband desktop with a 100ms RTT to the witi hosting facility. The average latency to receive the first byte of a resource dropped from 1697ms to 626ms, and the average elapsed time per transaction overall dropped from 1719ms to 652ms. Aggregate that over 159 different resources and you have some serious gains!
But why stop there? The pipeline sweet spot is in high latency situations such as mobile, or trans continental data transfer. This is what happens when we add 200 ms of latency to the connection:
That's right - 3300ms of improvement on each transaction! That seems absurdly good if we only added 200ms of latency, but what you're seeing is the aggregate queueing effect - Firefox wants 150 resources more or less simultaneously and can only parallelize it on 6 connections. If you are 25 positions deep on that queue you will have to wait at least 7500ms just for the back and forth of each transaction in front of you to complete.. obviously not everyone is queued that deeply so the average effect is somewhat less, but still overwhelming.
I stumbled across a great example recently: Women In Technology International. That home page is setup in a pretty typical newsletter format. It has 159 resources, 145 of which are images along with about a half dozen pieces of js and css. Most of the images are small, with over 2/3 of them loading in less than 20ms of transfer time (time to first byte removed).
What is striking about this page is how large of an advantage pipelining can give even on a well connected broadband desktop with a 100ms RTT to the witi hosting facility. The average latency to receive the first byte of a resource dropped from 1697ms to 626ms, and the average elapsed time per transaction overall dropped from 1719ms to 652ms. Aggregate that over 159 different resources and you have some serious gains!
But why stop there? The pipeline sweet spot is in high latency situations such as mobile, or trans continental data transfer. This is what happens when we add 200 ms of latency to the connection:
That's right - 3300ms of improvement on each transaction! That seems absurdly good if we only added 200ms of latency, but what you're seeing is the aggregate queueing effect - Firefox wants 150 resources more or less simultaneously and can only parallelize it on 6 connections. If you are 25 positions deep on that queue you will have to wait at least 7500ms just for the back and forth of each transaction in front of you to complete.. obviously not everyone is queued that deeply so the average effect is somewhat less, but still overwhelming.
Wednesday, February 2, 2011
HTTP Parallel Connections (Firefox edition!)
Parallelism helps when
Parallelism hurts when
I wonder if we can't do better in Firefox.. First, lets deal with the case of a low latency request. Right now all we do with them is to put them at the top of the waiting queue if the request cannot be dispatched immediately (because the limit of 6 has already been reached). But there are really two cases to consider:
In the case of a saturated network some of the existing parallel connections should be stalled while the low latency request is satisfied in order to provide the most bandwidth for that important transaction. We can do this by temporarily slamming their recv windows to something close to 1 packet of data which will slow them down to a trickle. This can be done commensurately with the transmission of the prioritized request as it should take 1/2 RTT for the window change to reach the sender.
But what about the more common case where all transactions are of equal priority - how do we make the decision then about opening a new connection vs queueing a new transaction? Assuming we aren't concerned about head of line blocking issues (which we should be able to wrap up in a definition of priorty somehow), then we want to do this only when there is network idleness that can be covered up by parallelism. This approach is radically different than "open up to N" connections.
It isn't obvious exactly how to determine that in Necko. But then again, you are looking for data bursts followed by idleness - and its pretty obvious when you see it graphed out. This is the transfer pattern of a single http response I looked at a couple of weeks ago - it could happily overlap with another flow in order to more effectively utilize the whole pipe. (of course, if the server used a larger initial CWND, the problem would be massively reduced.)
- It hides network idleness during TCP Handshakes though persistent connections help with this too.
- It hides network idleness during the first byte phase transactions, though pipelining can address this too.
- It hides network idleness during TCP slow start wait-for-ack periods. This is a big one.
- It provides a mechanism to prioritize and avoid head of line blocking problems.
- It steals bandwidth from competing "tcp friendly" flows by simply increasing the number of flows in one application. That's an arms race that most people think should be avoided.
Parallelism hurts when
- It increases the number of TCP Handshakes which are both slow and CPU intensive (at least compared to regular data packets) to execute - this assumes persistent connections are an alternative.
- It increases the overhead of normal data processing because more flows have to be considered typically via longer hash chains
- It increases the impact of memory overhead and processor cache pollution by increasing the number of simultaneous TCP control blocks that have to managed on both the client and the server.
- The resulting reduced amount of data per flow makes it harder to fully open sender congestion windows.
- Packet loss is increased due to the non correlated fluctuations of data to be sent between the parallel connections. Two competing flows that are both sending from infinite data sources will quickly adapt to share the bandwidth, but two flows that have a fluctuating demand (e.g. parallel persistent HTTP connections that periodically go idle and alive) will inherently have patterns of underutilizing and overutilizing the path. Overutilization results in either packet loss or excess buffering in the network, which leads to poor interactive response times.
- when I don't have an idle connection and I need the answer with minimum latency
- when I expect existing connections are experiencing idleness and therefore not using all of the available bandwidth
I wonder if we can't do better in Firefox.. First, lets deal with the case of a low latency request. Right now all we do with them is to put them at the top of the waiting queue if the request cannot be dispatched immediately (because the limit of 6 has already been reached). But there are really two cases to consider:
- What to do when the network is not already saturated
- What to do when the network is saturated
In the case of a saturated network some of the existing parallel connections should be stalled while the low latency request is satisfied in order to provide the most bandwidth for that important transaction. We can do this by temporarily slamming their recv windows to something close to 1 packet of data which will slow them down to a trickle. This can be done commensurately with the transmission of the prioritized request as it should take 1/2 RTT for the window change to reach the sender.
But what about the more common case where all transactions are of equal priority - how do we make the decision then about opening a new connection vs queueing a new transaction? Assuming we aren't concerned about head of line blocking issues (which we should be able to wrap up in a definition of priorty somehow), then we want to do this only when there is network idleness that can be covered up by parallelism. This approach is radically different than "open up to N" connections.
It isn't obvious exactly how to determine that in Necko. But then again, you are looking for data bursts followed by idleness - and its pretty obvious when you see it graphed out. This is the transfer pattern of a single http response I looked at a couple of weeks ago - it could happily overlap with another flow in order to more effectively utilize the whole pipe. (of course, if the server used a larger initial CWND, the problem would be massively reduced.)
Separating HTTP Connections from TCP Connections
The firefox http connection implementation, and most others I have seen, binds the http connection and the tcp connection together 1 to 1 something like this:
It turns out this happens, anecdotally, a whole freaking lot. My test network has a delay of about 250ms between Firefox and each server.
Loading the NY Times required 219 TCP connections of which 141 (64%) were able to be served on a different persistent connection that became available between the time we started to open the connection and the time the handshake actually completed.
Loading the wall of one of my facebook pals saw this behavior on 9 of 27 connections (33%), and the cnn home page performed similarly to NYT - 76/117 (65%).
This makes some intuitive sense - if you have a piece of HTML that includes an img it is entirely likely that we will start the request for the img before we have finished transferring the HTML, but the HTML will finish transferring before the new handshake (which will take a full RTT) completes. This algorithm just moves the request for the img over to the HTML persistent connection which can proceed as soon as the HTML is done.
The amount of latency saved is variable - indeed it is probably at least somewhat uniformly random with 0 and RTT as its bounds. I was seeing a mean of around 80ms on each one. Of course this doesn't apply to all transactions - just ones that are opening TCP connections to servers that also do persistent connections.
but its still cool.
- Check pconn pool for idle http connection
- if that succeeds, dispatch
- if limits allow, open a tcp connection and when that completes dispatch on it
- otherwise queue it and goto 1 when an existing transaction completes
- Check pconn pool for idle http connection
- if that succeeds, dispatch
- queue transaction
- if limits allow, start a tcp connection which will be added to the pconn pool when it completes
- whenever a pconn is available or a transaction completes check queue for dispatch
It turns out this happens, anecdotally, a whole freaking lot. My test network has a delay of about 250ms between Firefox and each server.
Loading the NY Times required 219 TCP connections of which 141 (64%) were able to be served on a different persistent connection that became available between the time we started to open the connection and the time the handshake actually completed.
Loading the wall of one of my facebook pals saw this behavior on 9 of 27 connections (33%), and the cnn home page performed similarly to NYT - 76/117 (65%).
This makes some intuitive sense - if you have a piece of HTML that includes an img it is entirely likely that we will start the request for the img before we have finished transferring the HTML, but the HTML will finish transferring before the new handshake (which will take a full RTT) completes. This algorithm just moves the request for the img over to the HTML persistent connection which can proceed as soon as the HTML is done.
The amount of latency saved is variable - indeed it is probably at least somewhat uniformly random with 0 and RTT as its bounds. I was seeing a mean of around 80ms on each one. Of course this doesn't apply to all transactions - just ones that are opening TCP connections to servers that also do persistent connections.
but its still cool.