Saturday, December 8, 2012

Managing Bandwidth Priorities In Userspace via TCP RWIN Manipulation

This post is a mixture of research and speculation. The speculative thinking is about how to manage priorities among a large number of parallel connections. The research is a survey of how different OS versions handle attempts by userspace to actively manage the SO_RCVBUF resource.

Let's consider the case where a receiver needs to manage multiple incoming TCP streams and the aggregate sending windows exceed the link capacity. How do they share the link?

There are lots of corner cases but it basically falls into 2 categories. Longer term flows try and balance themselves more or less evenly with each other, and new flows (or flows that have been quiescent for a while) blindly inject data into the link based on their Initial Window configuration (3, 10, or even 40 [yes, 40] packets).

This scenario is a pretty common one for a web browser on a site that does lots of sharding.. It is common these days to see 75, 100, or even 200 parallel connections used to download a multi megabyte page. Those different connections have an internal priority to the web browser, but that information is more or less lost when HTTP and TCP take over - the streams compete with each other blind to the facts of what they carry or how large that content is. If you have 150 new streams sending IW=10 that means 1500 packets can be sent more or less simultaneously even if it is of a low priority. That kind of data is going to dominate most links and cause losses on others. Of course, unknown to the browser, those streams might only have 1 packet worth of data to send (small icons, or 304's) - it would be a shame to dismiss parallelism in those cases out of fear of sending too fast. Capping parallelism at a lower number is a crude approach that will hurt many use cases and is very hard to tune in any event.

The receiver does have one knob to influence the relative weights of the incoming streams: the TCP Receive Window (rwin). The sender generally determines the rate a stream can transfer at (via the CWND calculation), but that rate is limited to no more than rwin allows.

I'm sure you see where this is going - if the browser wants to squelch a particular stream (because the others are more important) it can reduce the rwin of the stream to below the Bandwidth Delay Product of the link - effectively slowing just that one stream. Viola - crude priorities for the HTTP/TCP connections! (RTT is going to matter a lot at this point for how fast they run - but if this is just a problem about sharding then RTTs should generally be similar amongst the flows so you can at least get their relative priorities straight).

setsockopt(SO_RCVBUF) is normally how userspace manipulates the buffers associated with the receive window. I set out to survey common OS platforms to see how they handled dynamic tuning of that parameter in order to manipulate the receiver window.

Win7 does the best job; it allows SO_RCVBUF to both dynamically increase and decrease rwin (Decreases require the WsaReceiveBuffering socket ioctl ). Increasing a window is instantaneous and window update is generated for the peer right away. However when decreasing a window (and this is true of all platforms that allow decreasing) the window is not slammed shut - it naturally degrades with data flow and is simply not recharged as the application consumes the data which results in the smaller window.

For instance a stream has an rwin of 64KB and decreases it to 2KB. A window update is not sent on the wire even if the connection is quiescent.. Only after the 62KB of data has been sent the window shrinks naturally to 2KB even if the application has consumed the data from the kernel - future reads will recharge rwin back to a max of 2KB. But this process is not accelerated in any way by the decrease of SO_RCVBUF no matter how long it takes. Of course there would always be 1RTT of time between the decrease of SO_RCVBUF and the time it takes effect (where the sender could send larger amounts of data) but by not "slamming" the value down with a window update (which I'm not even certain would be tcp compliant) that period of time is extended from 1RTT to indefinite. Indeed, the application doesn't need SO_RCVBUF at all to achieve this limited definition of decreasing the window - it can simply stop consuming data from the kernel (or it can pretend to do so by using MSG_PEEK) and that would be no more or less effective.

If we're trying to manage a lot of parallel busy streams this strategy for decreasing the window will work ok - but if we're trying to protect against a new/quiescent stream suddenly injecting a large amount of data it isn't very helpful. And that's really the use case I have in mind.

The other thing to note about windows 7 is that the window scale option (which effectively controls how large the window can get and is not adjustable after the handshake) is set to the smallest possible value for the SO_RCVBUF set before the connect. If we agree that starting with a large window on a squelched stream is problematic because decreases don't take effect quickly enough, that implies the squelched stream needs to start with a small window. Small windows will not need window scaling. This isn't a problem for the initial squelched state - but if we want to free the stream up to run at a higher speed (perhaps because it now has the highest relative priority of active streams after some of the old high priority ones completed) the maximum rwin is going to be 64KB - going higher than that requires window scaling. a 64KB window can support significant data transfer (e.g. 5 megabit/sec at 100ms rtt) but certainly doesn't cover all the use cases of today's Internet.

When experimenting with Vista I found behavior similar to Windows 7. The only difference I noted was that it always used a window scale in the handshake even if initially a rwin < 64KB was going to be used. This allows connections with small initial windows to be raised to greater sizes than is possible on windows 7 - which for the purposes of this scheme is a point in vista's favor.

I speculate that the change was made in windows 7 to increase interoperability - window scale is sometimes an interop problem with old nats and firewalls and the OS clearly doesn't anticipate the rwin range to be actively manipulated while the connection is established.. therefore if window scale isn't needed for the initial window size you might as well omit it out of an abundance of caution.

By default XP does not do window scaling at all - limiting us to 64KB windows and therefore multiple connections are required to use all the bandwidth found in many homes these days. It doesn't allow shrinking rwin at all (but as we saw above the vista/win-7 approach to shrinking the window isn't more useful than one that can be emulated through non-SO_RCVBUF approaches).

XP does allow raising the rwin. So a squelched connection could be setup with a low initial window and then raised up to 64KB when its relative ranking improved. The sticky wicket here is that it appears that attempts to set SO_RCVFBUF below 16KB don't work. 16KB maps to a new connection with IW=10 - having a large set of new squlched connections all with the capacity to send 10 packets probably doesn't meet the threshold of squelched.

OS X (10.5)
The latest version of OS X I have handy is dated. Nothing in a google search leads me to believe this has changed since 10.5, but I'm happy to take updates.

OS X, like XP, does not allow decreasing a window through SO_RCVBUF.

It does allow increasing one if the window size was set before the connection - otherwise "auto tuning" is used and cannot be manipulated while the connection is established.

Like vista, the initial window determines the scaling factor, and assuming a small window on a squelched stream that means window scaling is disabled and the window can only be increased to 64KB for the life of the connection.

Linux can decrease rwin, but as with other platforms requires data transfer to do it instead of a 1-RTT slamming shut of the window. Linux does not allow increasing the window past the size it has when the connection is established. So you can start a window large, slowly decrease it, and then increase it back to where it started.. but you can't start a window small and then increase it as you might want to do with a squelched stream.

It pains me to say this, and it is so rarely true, but this makes Linux the least attractive development platform for this approach.

From this data it seems viable to attempt a strategy for {windows >= vista and OS X} that mixes 3 types of connections:
  1. Full Autotuned connections. These are unsquelched, can never be slowed down, generally use window scaling and are capable of running at high speeds
  2.  Connections that begin with small windows and are currently squelched to limit the impact of new low priority streams in highly parallel environments
  3.  Connections that were previously squelched but have now been upgraded to 64KB windows.. "almost full" (1) if you will.