Bits Up!

Tuesday, February 15, 2011

Note to Self: The Web is Slow

I've made a living dealing with fast networks and servers that run at really impressive transaction rates using all manner of nifty interconnects and parallelism. Sometimes I forget that the day to day web isn't all that fast in comparison.

My local copy of Firefox is annotated to dump a bunch of network stats when shutting down. One of them is a CDF of HTTP handshake times. This is from my desktop, which is connected to a premium cable broadband consumer Internet service. It's not as awesome as FIOS, but its still at the top portion of what a home consumer will have in the US, which in turn has certain geographic advantages when connection to many hosting companies. Its fair to say my performance is going to be at least a bit better than the average Internet user. And it is still slow. (We of course need to work to be able to better characterize what the real spectrum of experience is.)

This isn't scientific. It is where I happen to browse, and its just one datapoint although I can tell you my gut says it is pretty typical output - gathered over 15,000 connections.

Only about half of my handshakes are where I want them to be: < 100ms. Most of the rest fall in the next 300ms. To be fair there is a little skew in here because the code doesn't separate https from http, and SSL has an extra RTT in there. But SSL is a small fraction of the overall sample.

And this is the desktop. Think mobile and wireless.

Latency matters.

Monday, February 14, 2011

The Apex of Pipelines

Every once in a while I'm still surprised at the potential upside of pipelines.

I stumbled across a great example recently: Women In Technology International. That home page is setup in a pretty typical newsletter format. It has 159 resources, 145 of which are images along with about a half dozen pieces of js and css. Most of the images are small, with over 2/3 of them loading in less than 20ms of transfer time (time to first byte removed).

What is striking about this page is how large of an advantage pipelining can give even on a well connected broadband desktop with a 100ms RTT to the witi hosting facility. The average latency to receive the first byte of a resource dropped from 1697ms to 626ms, and the average elapsed time per transaction overall dropped from 1719ms to 652ms. Aggregate that over 159 different resources and you have some serious gains!

But why stop there? The pipeline sweet spot is in high latency situations such as mobile, or trans continental data transfer. This is what happens when we add 200 ms of latency to the connection:

That's right - 3300ms of improvement on each transaction! That seems absurdly good if we only added 200ms of latency, but what you're seeing is the aggregate queueing effect - Firefox wants 150 resources more or less simultaneously and can only parallelize it on 6 connections. If you are 25 positions deep on that queue you will have to wait at least 7500ms just for the back and forth of each transaction in front of you to complete.. obviously not everyone is queued that deeply so the average effect is somewhat less, but still overwhelming.

Wednesday, February 2, 2011

HTTP Parallel Connections (Firefox edition!)

Parallelism helps when

It hides network idleness during TCP Handshakes though persistent connections help with this too.
It hides network idleness during the first byte phase transactions, though pipelining can address this too.
It hides network idleness during TCP slow start wait-for-ack periods. This is a big one.
It provides a mechanism to prioritize and avoid head of line blocking problems.
It steals bandwidth from competing "tcp friendly" flows by simply increasing the number of flows in one application. That's an arms race that most people think should be avoided.

Parallelism hurts when

It increases the number of TCP Handshakes which are both slow and CPU intensive (at least compared to regular data packets) to execute - this assumes persistent connections are an alternative.
It increases the overhead of normal data processing because more flows have to be considered typically via longer hash chains
It increases the impact of memory overhead and processor cache pollution by increasing the number of simultaneous TCP control blocks that have to managed on both the client and the server.
The resulting reduced amount of data per flow makes it harder to fully open sender congestion windows.
Packet loss is increased due to the non correlated fluctuations of data to be sent between the parallel connections. Two competing flows that are both sending from infinite data sources will quickly adapt to share the bandwidth, but two flows that have a fluctuating demand (e.g. parallel persistent HTTP connections that periodically go idle and alive) will inherently have patterns of underutilizing and overutilizing the path. Overutilization results in either packet loss or excess buffering in the network, which leads to poor interactive response times.

When should we open a new parallel connection?

when I don't have an idle connection and I need the answer with minimum latency
when I expect existing connections are experiencing idleness and therefore not using all of the available bandwidth

The approach HTTP implementations, including firefox, take to solving this quandary? They crudely enforce a constant number of connections per host and open them until they hit that limit. Variously across time that limit has commonly been 2, 4, and 6. As server technology has evolved to the point many years ago where the impact of idleness was a bigger deal than the CPU overhead on the server we saw servers actually publish their resources under several virtual host names, even though it was all the same server, for the exclusive purpose of circumenting that per-host limit in the client.

I wonder if we can't do better in Firefox.. First, lets deal with the case of a low latency request. Right now all we do with them is to put them at the top of the waiting queue if the request cannot be dispatched immediately (because the limit of 6 has already been reached). But there are really two cases to consider:

What to do when the network is not already saturated
What to do when the network is saturated

In both cases the first step for a truly low latency request is the same - open a new connection assuming there isn't an idle one available. However, note that establishing that connection is going to take at least 1 RTT for normal HTTP and 2 RTT for HTTPs - so we should actually watch for any existing HTTP transactions to complete on a different reusable persistent connection in between the time we start opening the new connection and the time the handshake is complete. If that happens the persistent connection should be used instead - that will require a change in the current logic where nsHttpConnection opens the sockets after it has been assigned a transaction. Instead nsHttpConnectionMgr should be opening the sockets as well as receiving the returned persistent connections and then should dispatch to them as they become available.

In the case of a saturated network some of the existing parallel connections should be stalled while the low latency request is satisfied in order to provide the most bandwidth for that important transaction. We can do this by temporarily slamming their recv windows to something close to 1 packet of data which will slow them down to a trickle. This can be done commensurately with the transmission of the prioritized request as it should take 1/2 RTT for the window change to reach the sender.

But what about the more common case where all transactions are of equal priority - how do we make the decision then about opening a new connection vs queueing a new transaction? Assuming we aren't concerned about head of line blocking issues (which we should be able to wrap up in a definition of priorty somehow), then we want to do this only when there is network idleness that can be covered up by parallelism. This approach is radically different than "open up to N" connections.

It isn't obvious exactly how to determine that in Necko. But then again, you are looking for data bursts followed by idleness - and its pretty obvious when you see it graphed out. This is the transfer pattern of a single http response I looked at a couple of weeks ago - it could happily overlap with another flow in order to more effectively utilize the whole pipe. (of course, if the server used a larger initial CWND, the problem would be massively reduced.)

Separating HTTP Connections from TCP Connections

The firefox http connection implementation, and most others I have seen, binds the http connection and the tcp connection together 1 to 1 something like this:

Check pconn pool for idle http connection
if that succeeds, dispatch
if limits allow, open a tcp connection and when that completes dispatch on it
otherwise queue it and goto 1 when an existing transaction completes

As part of the refactoring of bugzilla 623948 I have separated this logic out a little bit down to its tcp roots:

Check pconn pool for idle http connection
if that succeeds, dispatch
queue transaction
if limits allow, start a tcp connection which will be added to the pconn pool when it completes
whenever a pconn is available or a transaction completes check queue for dispatch

The major difference is that when a transaction in the first scenario decides to open an http connection it always waits for that connection to complete and then dispatches on it. In the second scenario the two actions are taken independently, if another connection frees up before the newly demanded TCP connection is ready we can use that instead (and then cache the connection when it does complete in the pconn pool).

It turns out this happens, anecdotally, a whole freaking lot. My test network has a delay of about 250ms between Firefox and each server.

Loading the NY Times required 219 TCP connections of which 141 (64%) were able to be served on a different persistent connection that became available between the time we started to open the connection and the time the handshake actually completed.

Loading the wall of one of my facebook pals saw this behavior on 9 of 27 connections (33%), and the cnn home page performed similarly to NYT - 76/117 (65%).

This makes some intuitive sense - if you have a piece of HTML that includes an img it is entirely likely that we will start the request for the img before we have finished transferring the HTML, but the HTML will finish transferring before the new handshake (which will take a full RTT) completes. This algorithm just moves the request for the img over to the HTML persistent connection which can proceed as soon as the HTML is done.

The amount of latency saved is variable - indeed it is probably at least somewhat uniformly random with 0 and RTT as its bounds. I was seeing a mean of around 80ms on each one. Of course this doesn't apply to all transactions - just ones that are opening TCP connections to servers that also do persistent connections.

but its still cool.

Tuesday, January 18, 2011

HTTP PSA: beware unpadded content-md5

You don't see a lot of HTTP Content-MD5 response headers, but I just discovered some piece of code that generates unpadded base 64 versions.. i.e. a 22 byte:

Content-MD5: 6Cxy6QbruJs0hrT/P8exaA

I figured HTTP followed MIME rules and required a multiple of 4 characters.. i.e:

Content-MD5: 6Cxy6QbruJs0hrT/P8exaA==

Weirdly, after checking the relevant specs it isn't actually clear to me if the = pad is required. I'm probably missing something obvious. But as this topic generates absolutely 0 google juice, this post is a public service announcement - expect both versions in your clients.

Monday, January 17, 2011

Firefox Pipeline Patches - Try them out

As you may know, I've been on a little odyssey to make pipelining safe, fast, and aggressive. I've got a dozen patches for Firefox 4 that do that.

This post is your chance to try them out. Download a build for your favorite OS and try them:

http://www.ducksong.com/misc/pipeline-builds/based-4.0-beta9-1/

You will still need to set network.http.pipelining to true in about:config in order to enable the code. You can also set network.http.pipelining.aggressive to true if you want to disable any of the "gently test the waters" code. I do that when I want to measure the pipelines, but leaving aggressive mode off is probably a good idea if you want to ensure the smoothest experience possible, but I gotta say that the various recovery mechanisms are working well enough (and are used rarely enough) that I am considering making normal mode considerably more like aggressive mode.

Consider this a beta level pre-test. I want your feedback and any bug reports.

Tuesday, January 11, 2011

Firefox Idle Connection Selection via CWND

When choosing between more than 1 idle persistent connection FF <= 4.0 goes with a FIFO approach. I was thinking about ways to tune this.

I was going to experiment with making that a LIFO. A LIFO afterall should have better cache behavior and would allow the cache size to shrink to a more natural size as the older members remain untouched and timeout. The FIFO will basically keep the size pinned at its maximum while it cycles through all the connections which wastes system RAM with both the client and server maintaining extra idle TCP sessions. It is also the worst possible processor cache behavior. The possible argument in favor of FIFO is actually that connections are expensive to open and we've already opened these so maybe we want to keep it pinned to its maximum size just in case we need it again - it isn't obvious what to do or if it matters much.

Thinking a little further I realized that the major differentiator between these sockets is not a timestamp at all - it is the CWND associated with them on the server. Many web servers at least partially ignore the TCP suggestion to return to slow start after an RTO of idle activity so it is reasonable to assume that some of the connections have larger CWNDs than others as long as we aren't in an environment where the previous transfers have been actually bottlenecked by bandwidth - and on the web that almost never happens, RTT and document size bottleneck most transfers. Even if available bandwidth is the real bottleneck that just moots our metric, it doesn't provide any information that steers us the wrong way.

When choosing which connection to use we want to choose the one that has the largest CWND on the server. Unfortunately we cannot see directly into the congestion information on the server, but assuming that previous transfers have been bottlenecked by RTT (approximately a constant for all connections to the same server) and transfer size then we can just use the one with the history of moving the largest amount of data in one burst as a proxy for it because that burst is what opens the congestion window.

I say one burst instead of one document because we want to include pipelines of multiple HTTP transactions as long as there wasn't an RTT between them. This is another reason to want pipelines - they will more effectively open up the CWND for us. We also want to use the maximum burst seen on the connection, not the necessarily the last burst - the servers we are interested in don't shrink the CWND just because it isn't all being used for each burst.

The implementation is easy - the idleconnection list is changed from being sorted as a FIFO to being sorted according to this "maxburst" metric. The code and bugzilla are here.

Using an experiment designed to show the best case, the results are better than I expected for such a minor tweak. This was my process:

construct a base page mixed with several small and several large images plus a link to a 25KB image. There are 6 objects on the base page.
load the base page - FF4 will use six parallel connections to do it
click on the link to the 25KB image - this will use an idle persistent connection. Measure the performance of this load.

There is 250ms of RTT and several megabits of bandwidth between the client and server in my test.

As expected, vanilla FF 4.0 (beta 9) loads the target image on an idle persistent connection that was used to transfer one of the smallest images on the base page. It is FIFO afterall, and the small image load was going to finish first and be put into the persistent connection pool first.

My patched browser, using the sort by CWND algorithm, loads the target image on the idle persistent connection that was used to transfer the largest image (2MB+) from the base page.

Their history is the only difference between the two connections - at the time of requesting the 25KB image they are both connected and idle. There is a 250ms RTT between my host and the server.

The difference is huge. The default algorithm takes 3 round trips to transfer the document as it works its way through growing the congestion window. That adds up to 793ms. The sort-by-cwnd algorithm inherits a congestion window large enough for the task at hand and moves it all in one stream - just 363ms.

This is a nice tweak, but it has its limitations - by definition you cannot meaningfully multiply the gain across a large number of transactions. If you have a large number of transactions then you probably are using all your idle connections in a burst and there is no point in discriminating between them if you are just going to use them all.

I would argue if you have that much pressure on the connection pool then you are probably queueing requests and should be using pipelining. If you don't have that much pressure, then this probably helps you.

Thursday, December 16, 2010

Accelerated Connection Retry for HTTP and Firefox

Not all packet loss is created equal. In particular, losing a SYN can really ruin your day - or at least the next 3 seconds which can feel like all day. Most Operating Systems take 3 seconds of waiting before retrying the SYN. Most other timeouts are dynamically scaled to the network conditions, but not the SYN. It is generally hardcoded. And on most of today's networks 3 seconds is an eternity.

So, in FF we took a page from Chrome's book and said if Firefox has been waiting for 250ms (configurable via network.http.connection-retry-timeout) then start a second connection in parallel with that first one. Assuming you've got .25% packet loss and general independence between packets spaced that far apart the approach turns the 3 second pause from a 1 in 400 event into a 1 in 16,000 event. That is pretty much the difference between "kinda annoying" and "didn't notice". It's a good idea - if you hate it for whatever reason just set the pref to 0 to disable it.

Taking the idea one step further, if we create two connections because of this timer and they actually both end up completing obviously only one can be used immediately. But we cache the other one as if it were a persistent connection - and then when you need it (which you probably will) you don't have to wait for the handshake at all. It is essentially a prefetched TCP connection. On my desktop, I run with an especially low timer so that any site with a > 100ms RTT benefits from this and its great!

You can see this effect below, using mnot's cool htracr, on the second connection. Note how there is no request placed on it as soon as it is live (the request is the red dot at the top of the grey rectangle - the rectangle represents the connection), but one follows shortly thereafter without having to do a handshake. That's an RTT saved!

You will be able to enjoy this feature in FF 4.0 Beta 9. A buggy version of it is actually included in Beta 8 but disabled behind the pref mentioned above. Feel free to enable and play with it before Beta 9 if you don't mind a connection freeze once in a while.

Wednesday, December 1, 2010

Performance of Pipelining in HTTP Firefox

This post provides some performance measurements of my HTTP pipeline patches for Firefox.

The key benefits of pipelining are reduced transaction latency and potentially the use of fewer connections, so those are generally the metrics I will focus on here. For each of my 5 test cases we will look at the following statistics:

The percentage of requests that are delayed in the request queue inside Firefox waiting for a connection to be available.
The average amount of queue latency for each transaction. This is measured as the time the first byte of the request is given to the kernel minus the time at which the request was presented to Necko/Gecko. It is generally very low but greater than 0 even if the transaction can be placed directly on a pipeline because it takes a moment to construct the request and perhaps schedule the socket thread if other things are on the CPU. But a high value is an opportunity lost - that is time the request could be in flight and the server could be processing it. It includes the time necessary to create a new connection if that is necessary - that includes the three way TCP handshake but not a DNS lookup (which is cached in the test).
The type of connection used for each transaction (new connection, an idle persistent connection, or a pipelined connection)
The average amount of transaction latency for each transaction. This is measured as the time of the first byte of the response being received by firefox minus the time at which the request was presented to Necko/Gecko. It is possible for the average improvement to be greater than 1 RTT because of cumulative queueing delays in the non pipelined case.
The cumulative fraction of transactions completed before 3 different elapsed times in order to show improved execution time for the test case. The benchmark times are sized appropriately for each test case.

There are 4 data points for each criteria in each test case. Because pipelining is aimed at environments with significant latency and my broadband test connectivity has below average latency for much of the world and every mobile environment the first two data points have 200ms of latency added through a traffic shaper. The two datapoints compare pipelining on vs pipelining off. The other data points measure the same things but without the induced latency.

All tests are run with both a disk and memory cache enabled but empty at the beginning of the run. In order to measure the effectiveness of the pipelining, each of these sites has been put in the "green - pipelining ok" state which is normally auto discovered.

Facebook

The first test is on Facebook. It starts by logging in and then selecting a particular user profile and navigating from that page via lists of friends, occasionally pulling up individual profiles, generating lists of recent updates, and pressing the More link on busy Facebook walls. There are approximately 1400 HTTP transactions in each test run.

The first thing to consider is the percent of requests that are delayed (i.e. queued) within firefox. I think queuing is particularly bad because if the request hasn't been passed to the network there is no way for any advances in server technology to ever operate on it. For instance - servers are prevented from returning responses out of order but nothing would prevent them from processing requests out of order in order to overlap latencies in DB queries and disk I/O if the requests were not queued on the browser side.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	79.1
Low Latency	0	78.6

That is a stark contrast. It is possible by the way to see a request being queued with pipelining enabled - a default configuration limit of 32 governs the maximum depth of the pipeline and not all request types are pipeline eligible.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	29	1630
Low Latency	6	285

You might expect to see 0ms for the pipelining case as we just illustrated above that no requests were delayed. But the queue latency covers the time from request submission to the time of putting the first byte of the request on the wire, so that includes any connection setup time when establishing a new connection. That is the primary source of the latency seen here for the pipeline enabled case.

That begs the question, when pipelining is enabled how many of the requests are pipelined?

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	2	4	2	3
Reused Idle	13	96	13	97
Pipeline	85	0	84	0

We see here a moderate reduction in the number of connections used when pipelining, but most of the effect is a transfer from idle persistent connections over to pipelines. While the percentage of new connections as a portion of the overall request stream has gone down just a tick with pipelining, the impact on the actual number of raw connections is significant - going from roughly 60 without pipelining to 30 with it enabled. That boils down to a 50% reduction in the number of connections created which is a significant provides a very busy site like Facebook a significant scalability boost.

The final criteria deal with transaction latency.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	702	1906
Low Latency	341	346

Yowza! Now there is a result. Under conditions with moderate latency the average transaction waits 1200ms less from the time the request is submitted to Necko to the time the first byte of the response header is received. The net effect is so much more than the approx ~250ms RTT because of aggregating queueing delays - without pipelining enabled you are placed in a deep queue which has to be totally cleared with a 1RTT overhead on each one before you are executed. The impact under low latency conditions is probably close to being noise.

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	97	91	75
Moderate Latency wo/Pipeline	45	38	32

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	92	61
Low Latency wo/Pipeline	99	91	61

Facebook is a big success - probably the biggest success of any of the tests. 200+ms latency situations have performance significantly increased, and low latency scenarios perform similarly while using a few less TCP connections.

Amazon.com

The Amazon.com test walks through a basic window shopping experience at Amazon.com. The home page is loaded, the kindle link is clicked, a few more categories are clicked and the lists of products are generally browsed and sorted by "hot and new" and other similar things. This boils down to about 800 HTTP transactions.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	54.4
Low Latency	0	39.8

Right away you can see that amazon queues fewer requests than Facebook, so the potential improvement is less.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	116	791
Low Latency	12	136

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	13	6	13
Reused Idle	20	96	28	87
Pipeline	68	0	66	0

The first thing to note is that less pipelining is going on that with facebook, so again there is less potential for improvement. How the pages are constructed has a lot to do with this (perhaps fewer images, etc..). But almost as interesting is the fact that the number of TCP connections (i.e. new connections) is halved in the low latency case. If the page can be transferred in the same amount of time using fewer connections that is still a win for the web overall.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	635	1083
Low Latency	266	204

An interesting result - 400ms off the average transaction in the ~250ms RTT environment, but a notable loss in the low latency scenario. All of the numbers here are averages across two test runs, but just inspecting the amazon test case in particular on some other ad-hoc runs showed quite a bit of variability. My suspicion is server load occasionally results in a single resource taking a long time to return. I have seen this disable pipelining for HTML pages, but leave it enabled for images, in the past.

Pct of Responses Rcvd in < Xms
	x=1200	x=900	x=600
Moderate Latency w/Pipeline	91	79	62
Moderate Latency wo/Pipeline	77	69	61

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	93	90	84
Low Latency wo/Pipeline	99	94	85

Flickr

The flickr test is probably the simplest of the cases. It simply loads several galleries based on set names and tags. There are roughly 350 HTTP transactions in the test. Under normal conditions Flickr has a high variability in server response time.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	57
Low Latency	0	57

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	43	814
Low Latency	7	211

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	10	27	12	27
Reused Idle	19	73	19	73
Pipeline	71	0	69	0

As with the other tests, more than half of the new connections have been replaced when pipelining is enabled.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	859	1091
Low Latency	291	366

This result is more modest, but still positive, when compared to our other tests.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	95	87	67
Moderate Latency wo/Pipeline	88	74	60

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	97	72
Low Latency wo/Pipeline	98	93	71

www.AsiaNewsPhoto.com

The test is photo journalism clearing house site located overseas and therefore the broadband low latency case has a starting RTT of closer to 100ms, while the moderate delay case adds 200ms to that. This is the smallest test case - just 175 transactions in each run.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	44
Low Latency	0	38

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	21	726
Low Latency	31	400

I am not yet certain how to explain the very modest rise in queue time for the pipeline case when the added 200ms delay is removed. It must involve an aberrant TCP connection as that is really the only component of queue time when the requests them selves are not delayed due to connection limits.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	16	7	11	8
Reused Idle	33	93	33	92
Pipeline	51	0	56	0

This is the first time we actually see the new connection numbers moving in the wrong direction. In this case I believe the type scheduling restrictions placed on the connection manager are generating new connections that may have been un-necessary in the non pipelining scenario. I'm curious if the effect would fade in a test with more transactions.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	1310	1248
Low Latency	752	689

This seems to track the changes in connection types and maybe the test bears further examination to see if an adjustment can be made. The scheduling algorithm seems to be getting in the way of itself and has made performance just a tick worse than before, though not by very much. And certainly not by enough to discount the gains made in some other scenarios.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	84	82	52
Moderate Latency wo/Pipeline	82	76	54

Pct of Responses Rcvd in < Xms
	x=1500	x=1000	x=500
Low Latency w/Pipeline	88	82	46
Low Latency wo/Pipeline	84	82	57

MapQuest

This test is different in that it is driven almost exclusively through JS and XMLHttpRequest. Those elements are present in the Facebook and Amazon tests as well, but they dominate the MapQuest test. In this scenario a map is brought up on the screen and it is manipulated in the usual ways - panning in 4 directions, zooming in and out, and toggling between satelline and map mode. By the time it is done, 711 HTTP transactions have been made.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	42
Low Latency	0	44

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	52	381
Low Latency	10	198

For all cases, the queue latency is pretty low for this test. That means the number of documents requested in one burst is relatively modest.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	16	11	14
Reused Idle	30	84	32	86
Pipeline	58	0	57	0

Marginally less new connections are used with pipelining. Hurrah.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	677	732
Low Latency	260	500

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	96	87	76
Moderate Latency wo/Pipeline	96	83	70

Pct of Responses Rcvd in < Xms
	x=900	x=600	x=300
Low Latency w/Pipeline	99	95	65
Low Latency wo/Pipeline	87	75	40

Tuesday, November 30, 2010

Implementing a Pipelined Client for Firefox

I have been working on a set of patches to revamp the Firefox HTTP pipeline implementation. The objective is to provide a fast and robust implementation of pipelines. There are lots of reasons to think this is a good thing.

The lowest level of the current implementation, which is disabled by default, receives a few minor tweaks. The primary change is to implement a continually fillable pipeline instead of the existing one where a pipeline is loaded up to whatever depth the request queue supports and then is not refilled until it has been emptied.

There are much more serious changes on top of that. The most significant is probably captured in the "Select Connection Based on Type and State" bug and patch. This code does four things: 1) it classifies each transaction into a particular type, 2) it keeps track of the history of each server with respect to pipelining, 3) it determines whether or not pipelining is appropriate based on the type of the transaction and the history of the server with that type in the past, and 4) if it decides to pipeline it places it on a pipeline filled with only transactions of the same classification.

That's a lot to think about. But the basic idea is to sort requests into control traffic (i.e. js and css), images, revalidations, htmlish things, and things known to be pipeline inappropriate such as video, most XMLHttpRequest, non-idempotent transactions, etc.. I call the latter classification "solo".

Classes of data such as CSS and revalidations that generally have very short responses, and thus benefit the most from pipelining, favor pipelines over parallel connections even when we have less than the maximum number of connections already open.

A variety of things can plague a successful pipelining session - for instance a site may have some documents with a large processing time on the server and that "think time" can really gum up the pipeline. By segregating the types of documents we can turn off pipelining for whatever is DB driven (e.g. XHTML) while still chasing down deep pipelines for images. Facebook is a terrific example of this.. the contents of your wall or your friend list probably have to be scooped out of a database computation and composed in real time, but the dozens of icons they reference are just whipped quickly right out when you have their direct key.

This concept of negative feedback (in the example above it would be a read of the HTML page with a latency well in excess of the RTT to the server) is what drives the server state. Very large responses, sub HTTP/1.1 responses, cancelled pipelines, closed connections, and server headers known to be associated with broken pipeline implementations all trigger negative feedback too.

Most feedback is tied to the classification of the particular request, but some is applied to all classes on the server. Corrupted data is one such case - if the response contains an invalid MD5 sum, or uses the proposed Assoc-Req header but fails to provide the correct information, or appears to have non-sensical framing information, all requests to that server are prevented from using pipelines for a long time. In the past such corruptions have happened due to buggy or compromised servers and intermediaries.

Much like congestion control, the determination of whether or not to proceed with a pipeline is based on both negative and positive feedback. Initially pipelines are not sent. We must first see a HTTP/1.1 response header from the server. After that has been accomplished we try and send a pipeline of only depth 2 and only on a single connection. If both of those responses are received OK then we transition from that tentative state (known internally as yellow) to a position where each connection can send pipelines of up to depth 4 instead of opening all the way. Even after the depth-of-2 test succeeds we can still not be certain the topology supports pipelines - what appears as a short pipeline at the client may not appear that way at the server due to race conditions inherent in network transfer, but the extension to a depth of 4 for every connection still represents significant additional capacity over a single connection being allowed a depth of 2. During this phase concurrent connections are of course also used, so nothing is slowed down over the historical pipelining disabled setting.

Once a transaction that was sent at at least a depth of 3 is successfully received the depth limits are removed from all connections to that host. The new maximum depth is only governed by a configuration preference with the default of 32. Should one of the negative events mentioned earlier occur, that class of transaction is placed in the red state (i.e. no pipelining is allowed for a period of time to that server) generally without interfering with the other transactions.

As mentioned, an unexpected delay of a few hundred milliseconds is considered to be think time and feedback is applied to keep things running smoothly with only a barely perceptible bump in the road. But a longer delay not only applies feedback for future use but it also cancels any transactions pipelined after the currently delayed one and moves them to new connections. This helps mitigate the head of line blocking problem if it is really severe and as a side effect no more pipelines will happen in the near future with that server. In that sense firefox is self correcting in a hostile environment.

All of this negative information expires over time but while it is still valid firefox will keep it persistently between restarts - its value was hard earned.

XMLHttpRequests are a special source of pain in traditional pipeline implementations. They often implement long polling patterns where a request is sent to the server and the server intentionally hangs until some external event occurs - it then shares that information with the client by forming a response. It is essentially server push from a data point of view implemented with a long polling request. For that reason, XHR is by default classified as solo, but meta data can be supplied in the form of an HTTP request header to allow the application to indicate the request is expected to be fulfilled quickly.

As they say on late night TV, "But Wait! There's more!" Sometimes the pipelining infrastructure problem isn't on the server side, sometimes it is a proxy that is part of the client's topology. To test for that there is a new startup test which initiates a deep pipeline to a known pipeline friendly resource on the Internet. This server intentionally defers its responses until a deep pipeline has been received there and only then leaks a small response. Both sides continue a pipeline on the same connection until it has been confirmed that an entire window of data can be supported by the HTTP devices on that path. The results of this test are cached for a very long time. This transfer also takes the opportunity to share with the client a list of host names that should be blacklisted with respect to pipelining because of known operability problems. (You can still use them - just not pipeline with them). The location (and enablement) of the test and hostname blacklist server are configurable.

All of this feedback and history tracking sounds intimidating, but the truth is that most of the web works just fine. There are enough problems that a feedback driven self correcting implementation helps smooth out the bumps, but most of the web opens right up without any problems.

Mail your comments to me or provide a talk back link.

The Value of HTTP Pipelines

For the past few months I've been on a personal quest to implement a safe and effective HTTP pipelining strategy for Mozilla Firefox. The pipeline concept is part of HTTP/1.1 but for various reasons has not been widely deployed on the web.

I'm going to make a series of three posts. This one will be basic background information - what pipelines are in the abstract and why they are more relevant to today's web than they have been in years. The second post will be about the details of my Firefox implementation and its mechanisms for dealing with the realities of the occasional pipeline-hostile piece of infrastructure. The last post will share some performance metrics of those patches in common use cases.

HTTP Pipelines - 101

A client forms a pipeline simply by sending a second HTTP request down an HTTP connection before it has received the first response. A pipeline may be more than two transactions deep. HTTP responses are still required to be returned in the order their requests arrived.

This is a simple concept, but the benefits are profound.

Chiefly, the pipelined transactions benefit by eliminating the latency involved in sending their request. The greater the latency, the greater the benefit.

This crude diagram shows 2 normal HTTP transactions without pipelining. In it each request is shown with a rightward arrow and the response is shown with a leftward arrow. The empty space represents the network latency. When the first response is received, the second is sent.

When pipelining is used, the picture becomes different. Even though the request and response sizes are the same, and the round trip time between the hosts is the same, the overall transaction completes sooner. It is obvious that the round trip latency between the client and the server has been mitigated as a problem.

The greater the latency and the deeper the pipeline the more benefit is captured.

While bandwidth continues to improve, latency is not keeping up. The trend is toward mobile networks and those have latencies 5x to 20x worse than broadband WAN connections. This increased latency is an opportunity for pipelining.

HTTP Pipelines - 201

Conceptually, parallel HTTP connections can garner similar benefits. Afterall, independent HTTP connections do not need to wait for each other to finish before beginning their work.

Up to a point, that is completely true. Parallelism has been the mainstay for many years and served us well. But it has its limits.

The chief limit is actually specified. No more than 6 (or 4, or 2 depending on the point in history) simultaneous connections are supposed to be allowed from the user agent to the server. This is a serious handicap - Firefox may easily have a request queue of over 100 images, stylesheets, and javascript items to fetch upon parsing a Facebook HTML page full of photos, emoticons, ads, and widgets. A single digit number of connections is far too small to effectively parallelize that download.

This brings us to the reason there is a limit on the number of connections. Making a new TCP connection sucks. Before it can be used at all it requires a high latency three way handshake and if it is over TLS an expensive RSA operation on the server (and yet another round trip). Making the new connection requires access to shared data structures on the server in a way that using an existing connection does not and harms scalability. Servers that can pump multiple gigabits a second of TCP data in their sleep through a few connections, can still only initiate on the order of tens of thousands of connections a second.

Maintaining all those connections is expensive for the server too. Each one takes state (i.e. RAM in the kernel, perhaps a thread stack in the application), and the server now has to deal with sorting through more control blocks and application states everytime a packet comes in in order to match it with the right one. Because of the increased diversity of TCP control blocks in use, the L2/L3 cache is severely polluted which is the number one factor in high performance TCP.

Even if pipelines only mean browser performance equal to parallel connections but accomplished using fewer connections then that is a win for Web scalability.

HTTP Pipelines - 301

The details of TCP start to weigh in heavily in favor of pipelining when parallelism and pipelining are compared.

Most significantly, TCP slow-start performs poorly in the parallelized environment. To recap what you already know: Slow start requires the sender to send a conservative amount of data at first (typically 3 or 4 packets) and then wait a full round trip time to receive positive feedback in the form of an ACK from the recipient. At that time it can grow the window a little bit and send another burst and then wait again. After enough iterations of this the sender is generating enough traffic in each burst to "fill the window" and it no longer has to wait.

Pipelining, of course, does not eliminate slow start. But it does use fewer connections to move the same amount of data and the slow-start penalty scales with the number of connections not the amount of data. Fewer connections mean less data is transferred under the artificially slow conditions of slow start.

There is another wrinkle that applies to even parallel connections that have paid their start-up dues and progressed past slow start. TCP connections that have not sent data within an RTO (think of an RTO as a round trip time plus a grace period) are supposed to go back to slow-start! From a TCP point of view this makes some sense - the inactivity means the TCP stack has lost the ack-clock that it needs to use the connection aggressively. But for even a persistent use of a parallel HTTP connection this is a disaster. Effectively each transaction response must go through slow start because it takes more than a round trip for the last packet of the previous response to travel to the client, the client to form the next request, the request to travel to the server and the server to form the response. Pipelined connections do not have that problem - one response follows immediately on the heels of the previous response which ensures optimal TCP use of the available bandwidth. Huzzah.

Lastly, it is worth talking about Google's push for larger initial windows during slow start. They support a value of 10 instead of the current 3 or 4. Basically I think this is a good thing - the Internet can support larger bursts than we are currently sending. Unfortunately, this has a terrible potential interaction with parallel connections. 6 connections would essentially mean 60 congestion control packet credits available to be sent in a burst at any time. Just as 3 is too few for a single connection environment, 60 is too many for a multiple connection environment. Whereas pipelines reduce the reliance on parallel connections, they bring down the total number of packet credits outstanding at any time while still allowing for more effective slow start capacity probing. That's a win win in my book.

Wednesday, September 1, 2010

Long Out Of Order Queueing Delays

Last week I posted about a kernel patch that records how long out of order TCP packets are kept hidden from userspace while the kernel tries to fill in the necessary holes to create an in-order stream.

These packets are especially frustrating - they have arrived at the host but the application does not have access to them until the kernel can create an in-order stream. Some applications that are really doing messaging over TCP (which might be sensible if you're looking for congestion control, maybe layered security, maybe multiplexing different semantic streams onto one TCP stream and the loss is localized to one of them, etc..) might be capable of moving on with their lives (and their data) more quickly if they had access to the missing sequence numbers. So the question is, how long are these kinds of applications waiting for in-order data when out-of-order data is already at their host?

I ran that code for about 10 days on my desktop which runs typical American broadband with normal RTTs anywhere from 40 to 150ms. The host is even SACK enabled. Here is what I found:

* 38,881 web flows (port 80 or 443)

* 164 flows that contained reordered packets. That is 4.1 per thousand.

* 915K total packets. 18,169 of them reordered. That is 19.8 per thousand.

* The flows with reordering account for just .4% of the flows, but 40% of the total packets. Obviously, the bigger you are the more likely you are to experience a reordering event.. but furthermore small flows sometimes don't reorder at all because any loss that impacts them is more likely to be repaired with an empty window and a timeout.

* If you select for just the .41% of flows that experience reordering a whopping 4.6% of the packets in that flow are reordered on average. Indeed this average flow is pretty large - 2418 packets and 110 of them are delayed due to being out of order. The average size of a flow that did not experience any reordering was just 24 data packets long. The fact that reordering events are such large clusters is probably good news - it likely means that we were seeing big windows of data on the wire and just a small amount of the early packets in that window were lost.. the rest of the window is counted as out-of-order. We want to see big windows in flight - so I'm good with that.

* The average length of a reordered packet is 1424 bytes. Over 97% of these reordered packets are at least 1300 bytes long. This isn't that interesting, but I wrote it down - so here it is.

When I talk about "N packets long" I mean my host received N packets with data in them.. bare acks and control packets are not counted.

So far, that's not too bad. Big flows have this happen all the time. Reordering is basically a pre-requisite for doing any kind of TCP fast recovery in the face of packet loss so it looks good. If we assume that the reordering is due to small packet losses which can be repaired with fast-retransmit algorithms, which seems to make sense, then the out of order problem should be repaired in a little over 1 RTT, right?

Unfortunately, when I plotted the delays incurred they look a lot bigger than the distribution of RTTs I see from my dekstop. A lot bigger.

There is a really long tail on that graph - and it only captures the best 97% of the data points. The longest I saw any packet wait in the reorder queue (and make it out again) was a full 2.5 minutes.

Even though 2.5 minutes is an aberration, the normal cases are still pretty awful. The median time out of order packets spend queued in the kernel while waiting for an in-order stream is 293 milliseconds! Ouch.

Let's zoom in on that graph - this shows the 90% of the packets that waited the shortest amount of time:

That's pretty ugly, you need to budget a full second wait to get 80% of the reordered packets out of their kernel limbo.

It is much much uglier than I expected.

It's not the reordering that bothers me - big reordering runs are to be perfectly expected in the face of a little packet loss and it is good to use that bandwidth while the loss is repaired.

But why is it taking so long to repair?