Saturday, December 8, 2012

Managing Bandwidth Priorities In Userspace via TCP RWIN Manipulation

This post is a mixture of research and speculation. The speculative thinking is about how to manage priorities among a large number of parallel connections. The research is a survey of how different OS versions handle attempts by userspace to actively manage the SO_RCVBUF resource.

Let's consider the case where a receiver needs to manage multiple incoming TCP streams and the aggregate sending windows exceed the link capacity. How do they share the link?

There are lots of corner cases but it basically falls into 2 categories. Longer term flows try and balance themselves more or less evenly with each other, and new flows (or flows that have been quiescent for a while) blindly inject data into the link based on their Initial Window configuration (3, 10, or even 40 [yes, 40] packets).

This scenario is a pretty common one for a web browser on a site that does lots of sharding.. It is common these days to see 75, 100, or even 200 parallel connections used to download a multi megabyte page. Those different connections have an internal priority to the web browser, but that information is more or less lost when HTTP and TCP take over - the streams compete with each other blind to the facts of what they carry or how large that content is. If you have 150 new streams sending IW=10 that means 1500 packets can be sent more or less simultaneously even if it is of a low priority. That kind of data is going to dominate most links and cause losses on others. Of course, unknown to the browser, those streams might only have 1 packet worth of data to send (small icons, or 304's) - it would be a shame to dismiss parallelism in those cases out of fear of sending too fast. Capping parallelism at a lower number is a crude approach that will hurt many use cases and is very hard to tune in any event.

The receiver does have one knob to influence the relative weights of the incoming streams: the TCP Receive Window (rwin). The sender generally determines the rate a stream can transfer at (via the CWND calculation), but that rate is limited to no more than rwin allows.

I'm sure you see where this is going - if the browser wants to squelch a particular stream (because the others are more important) it can reduce the rwin of the stream to below the Bandwidth Delay Product of the link - effectively slowing just that one stream. Viola - crude priorities for the HTTP/TCP connections! (RTT is going to matter a lot at this point for how fast they run - but if this is just a problem about sharding then RTTs should generally be similar amongst the flows so you can at least get their relative priorities straight).

setsockopt(SO_RCVBUF) is normally how userspace manipulates the buffers associated with the receive window. I set out to survey common OS platforms to see how they handled dynamic tuning of that parameter in order to manipulate the receiver window.

WINDOWS 7
Win7 does the best job; it allows SO_RCVBUF to both dynamically increase and decrease rwin (Decreases require the WsaReceiveBuffering socket ioctl ). Increasing a window is instantaneous and window update is generated for the peer right away. However when decreasing a window (and this is true of all platforms that allow decreasing) the window is not slammed shut - it naturally degrades with data flow and is simply not recharged as the application consumes the data which results in the smaller window.

For instance a stream has an rwin of 64KB and decreases it to 2KB. A window update is not sent on the wire even if the connection is quiescent.. Only after the 62KB of data has been sent the window shrinks naturally to 2KB even if the application has consumed the data from the kernel - future reads will recharge rwin back to a max of 2KB. But this process is not accelerated in any way by the decrease of SO_RCVBUF no matter how long it takes. Of course there would always be 1RTT of time between the decrease of SO_RCVBUF and the time it takes effect (where the sender could send larger amounts of data) but by not "slamming" the value down with a window update (which I'm not even certain would be tcp compliant) that period of time is extended from 1RTT to indefinite. Indeed, the application doesn't need SO_RCVBUF at all to achieve this limited definition of decreasing the window - it can simply stop consuming data from the kernel (or it can pretend to do so by using MSG_PEEK) and that would be no more or less effective.

If we're trying to manage a lot of parallel busy streams this strategy for decreasing the window will work ok - but if we're trying to protect against a new/quiescent stream suddenly injecting a large amount of data it isn't very helpful. And that's really the use case I have in mind.

The other thing to note about windows 7 is that the window scale option (which effectively controls how large the window can get and is not adjustable after the handshake) is set to the smallest possible value for the SO_RCVBUF set before the connect. If we agree that starting with a large window on a squelched stream is problematic because decreases don't take effect quickly enough, that implies the squelched stream needs to start with a small window. Small windows will not need window scaling. This isn't a problem for the initial squelched state - but if we want to free the stream up to run at a higher speed (perhaps because it now has the highest relative priority of active streams after some of the old high priority ones completed) the maximum rwin is going to be 64KB - going higher than that requires window scaling. a 64KB window can support significant data transfer (e.g. 5 megabit/sec at 100ms rtt) but certainly doesn't cover all the use cases of today's Internet.

WINDOWS VISTA
When experimenting with Vista I found behavior similar to Windows 7. The only difference I noted was that it always used a window scale in the handshake even if initially a rwin < 64KB was going to be used. This allows connections with small initial windows to be raised to greater sizes than is possible on windows 7 - which for the purposes of this scheme is a point in vista's favor.

I speculate that the change was made in windows 7 to increase interoperability - window scale is sometimes an interop problem with old nats and firewalls and the OS clearly doesn't anticipate the rwin range to be actively manipulated while the connection is established.. therefore if window scale isn't needed for the initial window size you might as well omit it out of an abundance of caution.

WINDOWS XP
By default XP does not do window scaling at all - limiting us to 64KB windows and therefore multiple connections are required to use all the bandwidth found in many homes these days. It doesn't allow shrinking rwin at all (but as we saw above the vista/win-7 approach to shrinking the window isn't more useful than one that can be emulated through non-SO_RCVBUF approaches).

XP does allow raising the rwin. So a squelched connection could be setup with a low initial window and then raised up to 64KB when its relative ranking improved. The sticky wicket here is that it appears that attempts to set SO_RCVFBUF below 16KB don't work. 16KB maps to a new connection with IW=10 - having a large set of new squlched connections all with the capacity to send 10 packets probably doesn't meet the threshold of squelched.

OS X (10.5)
The latest version of OS X I have handy is dated. Nothing in a google search leads me to believe this has changed since 10.5, but I'm happy to take updates.

OS X, like XP, does not allow decreasing a window through SO_RCVBUF.

It does allow increasing one if the window size was set before the connection - otherwise "auto tuning" is used and cannot be manipulated while the connection is established.

Like vista, the initial window determines the scaling factor, and assuming a small window on a squelched stream that means window scaling is disabled and the window can only be increased to 64KB for the life of the connection.

LINUX/ANDROID
Linux can decrease rwin, but as with other platforms requires data transfer to do it instead of a 1-RTT slamming shut of the window. Linux does not allow increasing the window past the size it has when the connection is established. So you can start a window large, slowly decrease it, and then increase it back to where it started.. but you can't start a window small and then increase it as you might want to do with a squelched stream.

It pains me to say this, and it is so rarely true, but this makes Linux the least attractive development platform for this approach.

CONCLUSIONS
From this data it seems viable to attempt a strategy for {windows >= vista and OS X} that mixes 3 types of connections:
  1. Full Autotuned connections. These are unsquelched, can never be slowed down, generally use window scaling and are capable of running at high speeds
  2.  Connections that begin with small windows and are currently squelched to limit the impact of new low priority streams in highly parallel environments
  3.  Connections that were previously squelched but have now been upgraded to 64KB windows.. "almost full" (1) if you will.
  4.  

Wednesday, December 5, 2012

Smarter Network Page Load for Firefox

I just landed some interesting code for bug 792438. It should be in the December 6th nightly, and If it sticks it will be part of Firefox 20.

The new feature keeps the network clear of media transfers while a page's basic html ,css, and js are loaded. This results in a faster first paint as the elements that block initial rendering are loaded without competition. Total page load is potentially slightly regressed, but the responsiveness more than makes up for it. Similar algorithms report first paint improvements of 30% with pageload regressions around 5%. There are some cases where the effect is a lot more obvious than 30% - try pinterest.com, 163.com, www.skyrock.com, or www.tmz.com - all of these sites have large amounts of highly parallel media on the page that, in previous versions, competed with the required css.

Any claim of benefit is going to depend on bandwidth, latency, and the contents of your cache (e.g. if you've already got the js and css cached, this is moot.. or if you have a localhost connection like talos tp5 uses it is also moot because bandwidth and latency are essentially ideal there).

Before I get to the nitty gritty I think its worth a paragraph of Inside-Mozilla-Baseball to mention what a fun project this was. I say that in particular because it involved a lot of cross team participation on many different aspects (questions, advice, data, code in multiple modules, and reviews). I think we can all agree that when many different people are involved on the same effort efficiency is typically the first casualty. Perhaps this project is the exception needed to prove the rule - it went from a poorly understood use case to commited code very quickly. Ehsan, Boris, Dave Mandelin, Josh Aas, Taras, Henri, Honza Bambas,  Joe Drew, and Chromium's Will Chan helped one and all - its not too often you get the rush of everyone rowing in sync; but it happened here and was awesome to behold and good for the web.

In some ways this feature is not intuitive. Web performance is generally improved by adding parallelization due to large amounts of unused bandwidth left over by the single session HTTP/1 model. In this case we are actually reducing parallelization which you wouldn't think would be a good thing. Indeed that is why it can regress total page load time for some use cases. However, parallelization is a double edged sword.

When parallelization works well it is because it helps cover up moments of idle bandwidth in a single HTTP transaction (e.g. latency in the handshake, latency during the header phase, or even latency pauses involved in growing the TCP congestion window) and this is critically important to overall performance. Servers engage in hostname sharding just to opt into dozens of parallel connections for performance reasons.

On the other hand, when it works poorly parallelization kills you in a couple of different ways. The more spectacular failure mode I'll be talking about in a different post (bug 813715) but briefly the over subscription of the link creates induced queues and TCP losses that interact badly with TCP congestion control and recovery. But the issue at hand here is more pedestrian - if you share a connection 50 ways and all 50 sessions are busy then each one gets just 2% of the bandwidth. They are inherently fair with each other and without priority even though that doesn't reflect their importance to your page.

If the required js and css is only getting 10% of the bandwidth while the images are getting 90% then the first meaningful paint is woefully delayed. The reason you do the parallelism at all is because many of those connections will be going through one of the aforementioned idle moments and aren't all simultaneously busy - so its a reasonable strategy as long as maximizing total bandwidth utilization is your goal.. but in the case of an HTML page load some resources are more important than others and it isn't worth sacrificing that ordering to perfectly optimize total page load. So this patch essentially breaks page load into two phases to sort out that problem.

The basic approach here is the same as used by Webkit's ResourceLoadScheduler. So Webkit browsers already do the basic idea (and have validated it). I decided we wanted to do this at the network layer instead of in content or layout to enable a couple extra bonuses that Webkit can't give because it operates on a higher layer:
  1. If we know apriori that a piece of work is highly sensitive to latency but is very low in bandwidth then we can avoid holding it back even if that work is just part of an HTTP transaction. As part of this patchset I have included the ability to preconnect a small quota of 6 media TCP sessions at a time while css/js is loading. More than 6 can complete, its just limited to 6 outstanding at one instant to bound the bandwidth consumption. This results in a nice little hot pool of connections all ready for use by the image transfers when they are cleared for takeoff. You could imagine this being slightly expanded to a small number of parallel HTTP media transfers that were bandwidth throttled in the future.
  2. The decision on holding back can be based on whether or not SPDY is in use if you wait until you have a connection to make the decision - spdy is a prioritized and muxed-on-one-tcp-session protocol that doesn't need to use this kind of workaround to do the right thing. In its case we should just send the requests as soon as possible with appropriate priorities attached to them and let the server do the right thing. The best of both worlds!
 This issue illustrates again how dependent HTTP/1 is on TCP behavior and how that is a shortcoming of the protocol. Reasonable performance demands parallelism, but how much is dependent on the semantics of the content, the conditions of the network, and the size of the content. These things are essentially unknowable and even characterizations of the network and typical content change very quickly. Its essential for the health of the Internet that we migrate away from this model onto HTTP/2.

Comments over here please: https://plus.google.com/100166083286297802191