Tuesday, July 21, 2009

Caution - (S)Low Bridge Ahead

This post will not be satisfying. Someone has posted some great datapoints about virtualized packet forwarding, which is great. But they don't make a lot of sense. Which is not great. Nor is it satisfying.

Oh well, I'm sure there will be a followup sometime in the future.

In this thread, Or Gerlitz posts a new networking type for qemu (and by extension) kvm which are of course popular linux host virtualization packages. The networking type is "raw" and the driver couldn't be more simple - a (v)lan interface on the host is opened with a AF_PACKET socket and all of the packets that appear there are shoved through to the guest interface, and vice versa.

This is a pretty direct way of doing things, but it has the unfortunate side effect that all of the guests and the host itself are aggregated onto one upstream switch port without any kind of bridge, switch, or router in between. This means that unless the upstream switch can do a u-turn when forwarding (and most of them will not), all of the guests and the host are isolated from each other. The normal way of doing things is to attach the guests and host together with a tun/tap socket and run a bridge on host. This bridge will do all the necessary forwarding so that everybody has full connectivity, and it lets you run iptables and ebtables on the host to boot.

That's all well and good, but the really interesting part was the motivation for running around tun/tap/bridge anyhow: the poster runs a test with short udp transmissions over gige.. running it between two real (non-vm) hosts he sees 450K packets per second. The post doesn't mention what hardware is involved, so we'll just take it as a black box baseline. Switching the sender to be a qemu guest with traditional tap/bridge networking it plummets to just 195K. The "raw" interface gets that back up to 240K - which is still a far cry from 450, eh?

Tap mode has 3 times the context switches than the raw version. I don't think I saw a number for the non-vm test. Other than that nothing, including the profiles, really jumps out.

The whole thread is worth reading - but the main data points are here and here

Monday, July 20, 2009

What is eating those Google SYN-ACKs?

In this post, I mentioned google was seeing huge packet loss on syn-acks from their servers. At the time it looked like 2%. That sounded nuts.

It still sounds nuts.

Someone else on the mailing list posted about that, and Jerry Chu of Google confirmed it:

Our overall pkt retransmission rate often goes over 1%. I was
wondering if SYN/SYN-ACK pkts are less likely to be dropped
by some routers due to their smaller size so we collected traces
and computed SYN-ACK retransmissions rate on some servers.
We confirmed it to be consistent with the overall pkt drop rate,
i.e., > 1% often.

You could imagine why the overall retransmission rate might be higher than the real drop rate due to jitter and various fast retransmit algorithms that might retransmit things that just hadn't been acknowledged quite yet. Even SYNs might be dropped at the host (instead of the network) due to queue overflows and such.. but we're talking about SYN-ACKs from busy servers towards what one would expect would be pretty idle google-searching clients. And these SYN-ACKs have giant timeouts (3 seconds - which is why Jerry was writing in the first place) so it certainly isn't a matter of over-aggressive retransmit. The only explanation seems to be packet loss. At greater than 1%


This probably has more to do with the global nature of google's audience than anything else. But still, TCP can really suck at loss rates that high. It must be very different than the desktop Internet I know (which is a fair-to-middlin cable service, not a fancy Fiber-To-The-Home setup which is becoming more common.)

I wonder exactly where those losses happen.

Tuesday, July 14, 2009

Google Thinks TCP should be more Aggressive by Default

Really interesting post from Jerry Chu of Google. He says Google has data which shows that we ought to lower the initial RTO, increase the initial CWND, drop the min RTO, and reduced the delayed ack time out in TCP.

Based on my own anecdotal data, I've done stuff like that in products I've worked on. Let's face it - 3 seconds is a freaking eternity. Processors, networks, and busses have all scaled but these constants remain the same. Jerry says Google has data that shows this is important. As the google data set is no doubt much more extensive than any I worked with, that's a really welcome post.

Probably the most important data point Jerry shares is that "up to a few percentage points" in his data set exhibit a SYN-ACK retransmission from the google servers. Wow. (at least) 1 in 50 syn-acks needs to be retransmitted? That's not my experience at all, and if true on google scale it is absolutely fascinating. Are they generally seeing 2% packet loss on google tx? There's no way that they are seeing that.. google would appear to suck! So what's going on... ? Why is syn-ack rexmitted more than anything else? (and I'm assuming they are indeed lost, because otherwise lowering the timeout wouldn't be the right remedy..)