Tuesday, January 11, 2011

Firefox Idle Connection Selection via CWND

When choosing between more than 1 idle persistent connection FF <= 4.0 goes with a FIFO approach. I was thinking about ways to tune this.

I was going to experiment with making that a LIFO. A LIFO afterall should have better cache behavior and would allow the cache size to shrink to a more natural size as the older members remain untouched and timeout. The FIFO will basically keep the size pinned at its maximum while it cycles through all the connections which wastes system RAM with both the client and server maintaining extra idle TCP sessions. It is also the worst possible processor cache behavior. The possible argument in favor of FIFO is actually that connections are expensive to open and we've already opened these so maybe we want to keep it pinned to its maximum size just in case we need it again - it isn't obvious what to do or if it matters much.

Thinking a little further I realized that the major differentiator between these sockets is not a timestamp at all - it is the CWND associated with them on the server. Many web servers at least partially ignore the TCP suggestion to return to slow start after an RTO of idle activity so it is reasonable to assume that some of the connections have larger CWNDs than others as long as we aren't in an environment where the previous transfers have been actually bottlenecked by bandwidth - and on the web that almost never happens, RTT and document size bottleneck most transfers. Even if available bandwidth is the real bottleneck that just moots our metric, it doesn't provide any information that steers us the wrong way.

When choosing which connection to use we want to choose the one that has the largest CWND on the server. Unfortunately we cannot see directly into the congestion information on the server, but assuming that previous transfers have been bottlenecked by RTT (approximately a constant for all connections to the same server) and transfer size then we can just use the one with the history of moving the largest amount of data in one burst as a proxy for it because that burst is what opens the congestion window.

I say one burst instead of one document because we want to include pipelines of multiple HTTP transactions as long as there wasn't an RTT between them. This is another reason to want pipelines - they will more effectively open up the CWND for us. We also want to use the maximum burst seen on the connection, not the necessarily the last burst - the servers we are interested in don't shrink the CWND just because it isn't all being used for each burst.

The implementation is easy - the idleconnection list is changed from being sorted as a FIFO to being sorted according to this "maxburst" metric. The code and bugzilla are here.

Using an experiment designed to show the best case, the results are better than I expected for such a minor tweak. This was my process:
  • construct a base page mixed with several small and several large images plus a link to a 25KB image. There are 6 objects on the base page.
  • load the base page - FF4 will use six parallel connections to do it
  • click on the link to the 25KB image - this will use an idle persistent connection. Measure the performance of this load.
There is 250ms of RTT and several megabits of bandwidth between the client and server in my test.

As expected, vanilla FF 4.0 (beta 9)  loads the target image on an idle persistent connection that was used to transfer one of the smallest images on the base page. It is FIFO afterall, and the small image load was going to finish first and be put into the persistent connection pool first.



My patched browser, using the sort by CWND algorithm, loads the target image on the idle persistent connection that was used to transfer the largest image (2MB+) from the base page.


Their history is the only difference between the two connections - at the time of requesting the 25KB image they are both connected and idle. There is a 250ms RTT between my host and the server.
 

The difference is huge. The default algorithm takes 3 round trips to transfer the document as it works its way through growing the congestion window. That adds up to 793ms. The sort-by-cwnd algorithm inherits a congestion window large enough for the task at hand and moves it all in one stream - just 363ms.

This is a nice tweak, but it has its limitations - by definition you cannot meaningfully multiply the gain across a large number of transactions. If you have a large number of transactions then you probably are using all your idle connections in a burst and there is no point in discriminating between them if you are just going to use them all.

I would argue if you have that much pressure on the connection pool then you are probably queueing requests and should be using pipelining. If you don't have that much pressure, then this probably helps you.