Wednesday, April 25, 2012

Making Firefox Search Snappier

The Firefox 15 development window just opened and I checked into inbound a cool feature that had been sitting in my queue for a little while. Let's see if it sticks!

The feature basically lets non-network code hint to the networking layer that it will probably send a http transaction to a particular site soon, but it isn't ready to do so yet. The network code can take the hint and begins to setup a TCP and (if necessary) SSL handshake right away because these are high latency operations.

The primary initial user of this is the search box in firefox. When you focus on that box we will probably make a connection to the search provider right away. Simultaneously with this you type your search terms - when your search is ready (or partly ready if you are using search suggestions) it can be submitted to the search service without any waiting for network setup.

This can be a significant win. The average Internet round trip time is about 100ms (there is a lot of variation in this). It takes 1 RTT to setup TCP and (likely) 2 more for SSL. 300ms is a notable pause, but overlapped in the background it essentially becomes free resulting in snappier searches that still use a secured transport. If you're using a SPDY enabled search provider such as google or twitter, this is done over SPDY, so the one TCP session now established will be able to carry all of your search results - no more setup overhead to worry about with additional parallel connections etc..

The other user of this feature that got checked in as part of this merge is actually internal to the networking code just before the cache I/O is done. The amount of time it takes to check the disk cache is extremely variable - afaict generally its pretty fast but the tail can be really awful depending on hardware, other system activity, OS, etc.. So we overlap the network handshakes with this activity that figures out the values of the If-Modified-Since (etc..) request headers.

There is an IDL for providing the hint (nsISpeculativeConnect) - so if you can think of other areas ripe for this kind of optimization let's get to it!

[The best place for comments is probably here: https://plus.google.com/100166083286297802191/posts ]


Monday, April 23, 2012

Welcome Jiten - Building a Networking Dashboard

There are many awesome things about the Firefox networking layer. However realtime visibility into what its doing is not on that list.

Thanks to Google Summer of Code, Jiten Thakkar is going to work on that problem this summer. Jiten is already a card carrying Mozillian with code commits in several parts of Gecko and I'm excited to have him focused on necko and an add-on for the next few months.

We hope to learn what connections are being made, how fast they are running, what the DNS cache looks like, how often SPDY is used, and all manner of other information that will aid debugging and inform optimization choices. Maybe even icing on the cake ssuch as in-browser diagnostic tools (e.g. Can I tcp handshake with an arbitrary hostname? How about HTTP? How about SSL?) to round things out.

Good Stuff!

Thursday, March 8, 2012

Twitter, SPDY, and Firefox

Well - look at what this morning brings: some Twitter.com enabled SPDY goodness in my firefox nightly! (below)

spdy is currently enabled in firefox nightly and you can turn it on in FF 11 and FF 12 through network.http.spdy.enabled in about:config

There is not yet any spdy support for the images on the Akamai CDN that twitter uses, and that's obviously a big part of performance. But still real deployed users of this are Twitter, Google Web, Firefox, Chrome, Silk, node etc.. this really has momentum because it solves the right problems.

Big pieces still left are a popular CDN, open standardization, a http<>spdy gateway like nginx, a stable big standalone server like apache, and support in a load balancing appliance like F5 or citrix. And the wind is blowing the right way on all of those things. This is happening very fast.

https://twitter.com/account/available_features

GET /account/available_features HTTP/1.1
Host: twitter.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120308 Firefox/13.0a1
[..]
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/
[..]

HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
Content-Length: 3929
Content-Type: text/javascript; charset=utf-8
Date: Thu, 08 Mar 2012 15:20:56 GMT
Etag: "8f2ef94f3149553a2c68e98a8df04425"
Expires: Tue, 31 Mar 1981 05:00:00 GMT
Last-Modified: Thu, 08 Mar 2012 15:20:56 GMT
Pragma: no-cache
Server: tfe
[..]
x-revision: DEV
x-runtime: 0.01763
[..]
x-xss-protection: 1; mode=block
X-Firefox-Spdy: 1

Monday, January 23, 2012

HTTP-WG Proposal to tackle HTTP/2.0

Huzzah to Mark Nottingham, chair of the IETF HTTP Working Group. He proposes rechartering the group to "specify (sic) HTTP/2.0 an improved binding of HTTP's semantics to the underlying transport."

That's welcome news - the scalability, efficiency, and robustness issues with HTTP/1 are severe problems that deserve attention in an open standards body forum. The HTTP-WG is the right place.

SPDY will certainly be offered as an input to that process and in my opinion it touches on the right answers. But whatever the outcome it is great to see momentum around open standardization of solutions to the transport problems HTTP/1 suffers from.

Saturday, January 7, 2012

Using SPDY for more responsive interfaces

RST_STREAM turns out to be a feature of spdy that I under appreciated for a long time. The concept is simple enough - either end of the connection can cancel an individual stream without impacting the other streams that are multiplexed on the same connection.

This fills a conceptual gap left by HTTP/1.x. - In HTTP when you want to cancel a transaction about all you can do is close the connection.

Consider the case of quickly clicking through a webmail mailbox - just scanning the contents and rapidly clicking 'next'. Typically the pages will be only partly loaded before you move on to the next one. Assuming you have used up all your parallelism in HTTP/1, the new click will either have to wait for the old transactions to complete (wasting time and bandwidth) or cancel the old ones by closing those connections and then open fresh connections for the new requests. New connections add 2 or 3 round trip times to reopen the SSL connection (you are reading email over SSL, right?) before they can be used to send the new requests. Either way - that is not a good experience.

An interactive map application has similar attributes - as you scan along the map, zooming in and out, lots of tiles are loaded and are often irrelevant before they are drawn. I'm sure you can think of other scenarios that have cancellations.

Spdy solves this simply - with its inherently much greater parallelism the new requests can be sent immediately and at the same time cancel notifications can go out for the old ones. That saves the bandwidth and gets the new requests going as fast as possible without interfering with either the established connection or any other transactions also in progress.

A page load time metric won't really show this to you but the increased responsiveness is very obvious when working with these kinds of use cases - especially under high latency conditions that make connection establishment slower.


Sunday, January 1, 2012

A use case for SPDY header compression

A use case for SPDY header compression: http://pix04.revsci.net/F09828/a4/0/0/0.js

380 bytes of gzipped javascript (550 uncompressed), sent with 8.8KB of request cookies and 5.5KB of response cookies. That overhead is bad enough to mess with TCP CWND defaults - which means you are taking multiple round trips on the network just to move half a KB of js. For HTTP, that's a performance killer! Those cookies are repeated almost identically on every transaction with that host.

SPDY's dedicated header contexts and the repetitive nature of cookies means those cookies can be reduced ~98% for all but the first transaction of the session. Essentially the cookies remain stateless for app developers, along with the nice properties of that, but the transport leverages the connection state to move them much more efficiently.

Thursday, December 8, 2011

SPDY, Bufferbloat, HTTP, and Real-Time Networking

Long router queue sizes on the web continue to be a hot networking topic - Jim Gettys has a long interview in ACM queue. Large unmanaged queues destroy low latency applications - just ask Randell Jessup

A paper like this does a good job of showing just how bad the situation can be - experimentally driving router buffering delay from 10ms to ~1000ms on many common broadband cable and DSL modems. I wish the paper had been able to show me the range and frequency of that queue delay under normal conditions.

I'm concerned  that decreasing the router buffer size, thereby increasing the drop rate, is detrimental to the current HTTP/1.x web. A classic HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't save you much - it has sent much of what it needs to send already anyhow. Unless you drop almost all of that flow from your buffers you haven't achieved much. Further, a loss event has a high chance of damaging the flow more seriously than you intended - dropping a SYN or the last packet of the data train is a packet that will have very slow retry timers, and short flows are comprised of high percentages of these kinds of packets. non-drop based loss notification like connex/ecn do less damage but are ineffective again because the short flow is more or less complete when the notification arrives so it cannot adapt the sending rate.

The problem is all of those other parallel HTTP sessions going on that didn't get the message. Its the aggregate that causes the buffer build up. Many sites commonly use 60-90 separate uncoordinated TCP flows just to load one page.

Making web transport more adaptable on this front is a big goal of my spdy work. When spdy consolidates resources onto the same tcp flow it means the remaining larger flows will be much more tcp friendly. Loss indicators will have a fighting chance of hitting the flow that can still backoff, and we won't have windows growing independently of each other. (Do you like the sound of IW=10 times 90? That's what 90 uncorrelated flows mean. IW=10 of a small number of flows otoh is excellent.). That ought to keep router queue sizes down and give things like rtcweb a fighting chance.

It also opens up the possibility of the browser identifying queue growth through delay based analysis and possibly helping the situation out by managing inside the browser our bulk tcp download rate (and definitely the upload rate) by munging the rwin or something like that. If it goes right it really shouldn't hurt throughput while giving better latency to other applications. It's all very pie in the sky and down the road, but its kind of hard to imagine in the current HTTP/1.x world.

Friday, November 11, 2011

Video of SPDY Talk at Codebits.eu

Yesterday, I was fortunate enough to be able to address the codebits.eu conference and share my thoughts on why SPDY is an important change for the web. They have made the video of my talk available on-line. (I guess that saves me the air-mozilla brownbag - just skip the 3 minute community-involvement video near the beginning assuming you've seen it already)

codebits is full of vitality. Portugal is lucky to have it.




Friday, September 23, 2011

SPDY: What I Like About You.

I've been working on implementing SPDY as an experiment in Firefox lately. We'll have to see how it plays out, but so far I really like it.

Development and benchmarking is still a work in progress, though interop seems to be complete. There are several significant to-do items left that have the potential to improve things even further. The couple of anecdotal benchmarks I have collected are broadly similar to the page load time based reports Google has shared at the IETF and velocity conf over the last few months.

tl;dr; Faster is all well and good (and I mean that!) but I'm going to make a long argument that SPDY is good for the Internet beyond faster page load times. Compared to HTTP, it is more scalable, plays nicer with other Internet traffic and brings web security forward.

SPDY: What I Like About You

#1: Infinite Parallelism with Shared Congestion Control.

You probably know that SPDY allows multiplexing of multiple HTTP resources inside one TCP stream. Unlike the related HTTP mechanisms of pipelining concurrent requests on one TCP stream, the SPDY resources can be returned in any order and even mixed together in small chunks so that head of line blocking is never an issue and you never need more than one connection to each real server. This is great for high latency environments because a resource never needs to be queued on either the client or the server for any reason other than network congestion limits.

Normal HTTP achieves transaction parallelism through parallel TCP connections. Browsers limit you to 6 parallel connections per host. Servers achieve greater parallelism by sharding their resources across a variety of host names. Often these host names are aliases for the same host, implemented explicitly to bypass the 6 connection limitation. For example, lh3.googleusercontent.com and lh4.googleusercontent.com are actually DNS CNAMEs for the same server. It is not uncommon to see performance oriented sites, like the Google properties, shard things over as many as 6 host names in order to allow 36 parallel HTTP sessions.

Parallelism is a must-have for performance. I'm looking at a trace right now that uses the afore mentioned 36 parallel HTTP sessions and its page load completes in 16.5 seconds. If I restrict it to just 1 connection per host (i.e. 6 overall), the same page takes 27.7 seconds to load. If I restrict that even further to just 1 connection in total in takes a mind numbing 94 seconds to load. And this is on 40ms RTT broadband - high latency environments such as mobile would suffer much much worse! Keep this in mind when I start saying bad things about parallel connections below, they really do great things and the web we have with them enables much more impressive applications than a single connection HTTP web ever could.

Of course using multiple parallel HTTP connections is not perfect - if they were perfect we wouldn't try to limit them to 6 at a time. There are two main problems. The first is that each connection requires a TCP handshake which incurs an extra RTT (or maybe 3 if you are using SSL) before the connection can be used. The TCP handshake is also relatively computationally hard compared to moving data (servers easily move millions of packets per second, while connection termination is generally measured in the tens of thousands), the SSL handshake even harder. Reducing the number of connections reduces this burden. But in all honesty this is becoming less of a problem over time - the cost of maintaining persistent connections is going down (which amortizes the handshake cost) and servers are getting pretty good at executing the handshakes (both SSL and vanilla) sometimes by employing the help of multi-tiered architectures for busy deployments.

The architectural problem lies in HTTP's interaction with TCP congestion control. HTTP flows are generally pretty short (a few packets per transaction), tend to stop and start a lot, and more or less play poorly with the congestion control model. The model works really well for long flows like a FTP download - that TCP stream will automatically adapt to the available bandwidth of the network and transfer at a fairly steady rate for its duration after a little bit of acclimation time. HTTP flows are generally too short to ever acclimate properly.

A SPDY flow, being the aggregation of all the parallel HTTP connections, looks to be a lot longer, busier, and more consistent than any of the individual parallel HTTP flows would be. Simply put - that makes it work better because all of that TCP congestion logic is applied to one flow instead of being repeated independently across all the parallel HTTP mini flows.

Less simply, when an idle HTTP session begins to send a response it has to guess at how much data should be put onto the wire. It does this without awareness of all the other flows. Let's say it guesses "4 packets" but there are no other active flows. In this case 4 packets is way too few and the network is under utilized and the page loads poorly. But what if 35 other flows are activated at the same time - this means 140 packets get injected into the network at the same time which is way too many. Under that scenario one of two things happen - both of them are bad:
  1. Packet Loss. TCP reacts poorly to packet loss, especially on short flows. While 140 packets in aggregate is a pretty decent flow, remember that total transmission is made up of 35 different congestion control blocks - each one covering a packet flow of only 4 packets. A loss is devastating to performance because most of the TCP recovery strategies don't work well in that environment.
  2. Over Buffering. This is what Jim Gettys calls bufferbloat. The giant fast moving 140 packet burst arrives at your cable modem head where the bandwidth is stepped down and most of those packets get in a long buffer to wait for their turn on your LAN. That works OK, certainly better than packet loss recovery does in practice, but the deep queue creates a giant problem for any interactive traffic that is sharing that link. Packets for those other applications (such as VOIP, gaming, video chat, etc..) now have to sit in this long queue resulting in interactive lag. Lag sucks for the Internet. The HTTP streams themselves also become non-responsive to cancel events because the only way to clear those queues is to wait them out - so clicking on a link to a new page is significantly delayed while the old page that you have already abandoned continues to consume your bandwidth.
This describes a real dilemma - if you guess more aggressive send windows then you will have a better chance of filling the network but you will also have a better chance of packet loss or over buffering. If you guess more conservative windows then loss and buffering happens less often but nothing ever runs very quickly. In the face of all those flows with independent congestion control blocks, there just isn't enough information available. (This of course is related to the famous Initial Window 10 proposal, which I support, but that's another non SPDY story.)

I'm sure you can see where this is going now. SPDY's parallelism, by virtue of being on a single TCP stream, leverages one busy shared congestion control block instead of dealing with 36 independent tiny ones. Because the stream is much busier it rarely has to guess at how much to send (you only need to guess when you're idle, SPDY is more likely to be getting active feedback), if it should drop a packet it reacts to that loss much better via the various fast recovery mechanisms of TCP, and when it is competing for bandwidth at a choke point it is much more responsive to the signals of other streams - reducing the over buffering problem.

It is for these reasons that SPDY is really exciting to me. Parallel connections work great - they work so great that it is hard to have SPDY significantly improve on the page load time of highly sharded site unless there is a very high amount of latency present.  But the structural advantages of SPDY enable important efforts like RTCWeb as well as provide better network utilization and help servers scale when compared to HTTP. Even if page load times only stay at par, those other good for the Internet attributes make it worth deploying.


#2: SPDY is over SSL every time.

I greatly lament that I am late to the school of SSL-all-the-time. I spent many years trying to eek the greatest amount of server responses per watt that was possible. I looked at SSL and saw impediments. That stayed with me.

I was right about the impediments, and I've learned a lot about dealing with them, but what I didn't get was that it is simply worth the cost. As we have all seen lately, SSL isn't perfect - but having a layer of protection against an entire class of eavesdropping attacks is a property that should be able to be relied upon in every protocol as generic as HTTP. HTTP does not provide that guarantee - but SPDY does.

huzzah.

As a incentive to make the transition to SSL all the time, this makes it worth deploying by itself.

#3:Header compression. 

SPDY compresses all the HTTP-equivalent headers using a specialized dictionary and a compression context that is reserved only for the headers so it does not get diluted with non-header references. This specialized scheme performs very well.

At first I really didn't think that would matter very much - but it is really a significant savings. HTTP's statelessness had its upsides, but the resulting on the wire redundancy was really over the top.

As a sample, I am looking right now at a trace of 1900 resources (that's about 40 pages). 760KB of total downstream plain text header bytes were received as 88KB compressed bytes, and upstream 949KB of plain text headers were compressed as just 65KB. I'll take 1.56MB (82%) in total overhead savings!  I even have a todo item that will make this slightly better.

Tuesday, February 15, 2011

Note to Self: The Web is Slow

I've made a living dealing with fast networks and servers that run at really impressive transaction rates using all manner of nifty interconnects and parallelism. Sometimes I forget that the day to day web isn't all that fast in comparison.

My local copy of Firefox is annotated to dump a bunch of network stats when shutting down. One of them is a CDF of HTTP handshake times. This is from my desktop, which is connected to a premium cable broadband consumer Internet service. It's not as awesome as FIOS, but its still at the top portion of what a home consumer will have in the US, which in turn has certain geographic advantages when connection to many hosting companies. Its fair to say my performance is going to be at least a bit better than the average Internet user. And it is still slow. (We of course need to work to be able to better characterize what the real spectrum of experience is.)

This isn't scientific. It is where I happen to browse, and its just one datapoint although I can tell you my gut says it is pretty typical output - gathered over 15,000 connections.


Only about half of my handshakes are where I want them to be: < 100ms. Most of the rest fall in the next 300ms. To be fair there is a little skew in here because the code doesn't separate https from http, and SSL has an extra RTT in there. But SSL is a small fraction of the overall sample.

And this is the desktop. Think mobile and wireless.

Latency matters.

Monday, February 14, 2011

The Apex of Pipelines

Every once in a while I'm still surprised at the potential upside of pipelines.

I stumbled across a great example recently: Women In Technology International. That home page is setup in a pretty typical newsletter format. It has 159 resources, 145 of which are images along with about a half dozen pieces of js and css. Most of the images are small, with over 2/3 of them loading in less than 20ms of transfer time (time to first byte removed).

What is striking about this page is how large of an advantage pipelining can give even on a well connected broadband desktop with a 100ms RTT to the witi hosting facility. The average latency to receive the first byte of a resource dropped from 1697ms to 626ms, and the average elapsed time per transaction overall dropped from 1719ms to 652ms. Aggregate that over 159 different resources and you have some serious gains!

But why stop there? The pipeline sweet spot is in high latency situations such as mobile, or trans continental data transfer. This is what happens when we add 200 ms of latency to the connection:

That's right - 3300ms of improvement on each transaction! That seems absurdly good if we only added 200ms of latency, but what you're seeing is the aggregate queueing effect - Firefox wants 150 resources more or less simultaneously and can only parallelize it on 6 connections. If you are 25 positions deep on that queue you will have to wait at least 7500ms just for the back and forth of each transaction in front of you to complete.. obviously not everyone is queued that deeply so the average effect is somewhat less, but still overwhelming.

Wednesday, February 2, 2011

HTTP Parallel Connections (Firefox edition!)

Parallelism helps when
    • It hides network idleness during TCP Handshakes though persistent connections help with this too.
    • It hides network idleness during the first byte phase transactions, though pipelining can address this too.
    • It hides network idleness during TCP slow start wait-for-ack periods. This is a big one.
    • It provides a mechanism to prioritize and avoid head of line blocking problems. 
    • It steals bandwidth from competing "tcp friendly" flows by simply increasing the number of flows in one application. That's an arms race that most people think should be avoided.

 Parallelism hurts when
    • It increases the number of TCP Handshakes which are both slow and CPU intensive (at least compared to regular data packets) to execute - this assumes persistent connections are an alternative.
    • It increases the overhead of normal data processing because more flows have to be considered typically via longer hash chains
    • It increases the impact of memory overhead and processor cache pollution by increasing the number of simultaneous TCP control blocks that have to managed on both the client and the server.
    • The resulting reduced amount of data per flow makes it harder to fully open sender congestion windows.
    • Packet loss is increased due to the non correlated fluctuations of data to be sent between the parallel connections. Two competing flows that are both sending from infinite data sources will quickly adapt to share the bandwidth, but two flows that have a fluctuating demand (e.g. parallel persistent HTTP connections that periodically go idle and alive) will inherently have patterns of underutilizing and overutilizing the path. Overutilization results in either packet loss or excess buffering in the network, which leads to poor interactive response times.
When should we open a new parallel connection?
  1. when I don't have an idle connection and I need the answer with minimum latency
  2. when I expect existing connections are experiencing idleness and therefore not using all of the available bandwidth
 The approach HTTP implementations, including firefox, take to solving this quandary? They crudely enforce a constant number of connections per host and open them until they hit that limit. Variously across time that limit has commonly been 2, 4, and 6.  As server technology has evolved to the point many years ago where the impact of idleness was a bigger deal than the CPU overhead on the server we saw servers actually publish their resources under several virtual host names, even though it was all the same server, for the exclusive purpose of circumenting that per-host limit in the client.

I wonder if we can't do better in Firefox.. First, lets deal with the case of a low latency request. Right now all we do with them is to put them at the top of the waiting queue if the request cannot be dispatched immediately (because the limit of 6 has already been reached). But there are really two cases to consider:
  1. What to do when the network is not already saturated
  2. What to do when the network is saturated
 In both cases the first step for a truly low latency request is the same - open a new connection assuming there isn't an idle one available. However, note that establishing that connection is going to take at least 1 RTT for normal HTTP and 2 RTT for HTTPs - so we should actually watch for any existing HTTP transactions to complete on a different reusable persistent connection in between the time we start opening the new connection and the time the handshake is complete. If that happens the persistent connection should be used instead - that will require a change in the current logic where nsHttpConnection opens the sockets after it has been assigned a transaction. Instead nsHttpConnectionMgr should be opening the sockets as well as receiving the returned persistent connections and then should dispatch to them as they become available.

In the case of a saturated network some of the existing parallel connections should be stalled while the low latency request is satisfied in order to provide the most bandwidth for that important transaction. We can do this by temporarily slamming their recv windows to something close to 1 packet of data which will slow them down to a trickle. This can be done commensurately with the transmission of the prioritized request as it should take 1/2 RTT for the window change to reach the sender.

But what about the more common case where all transactions are of equal priority - how do we make the decision then about opening a new connection vs queueing a new transaction? Assuming we aren't concerned about head of line blocking issues (which we should be able to wrap up in a definition of priorty somehow), then we want to do this only when there is network idleness that can be covered up by parallelism. This approach is radically different than "open up to N" connections.

It isn't obvious exactly how to determine that in Necko. But then again, you are looking for data bursts followed by idleness - and its pretty obvious when you see it graphed out. This is the transfer pattern of a single http response I looked at a couple of weeks ago - it could happily overlap with another flow in order to more effectively utilize the whole pipe. (of course, if the server used a larger initial CWND, the problem would be massively reduced.)