Wednesday, June 20, 2007

More Characterization - DNS Latency

A little while back I posted about the incoming TCP handshake latencies on my boutique broadband mail server. In short, they were awful - 184ms median and 77% of all handshakes took more than 100ms. A few hypotheses were drawn:
  • A lot of my mail is spam. Much of the spam comes from botnet owned hosts on consumer internet connections. Because of this I am not so much measuring latency from my edge to core Internet services, but instead to other edges. If this is true, it actually has some interesting peer to peer insights.
  • Because we are likely dealing with "owned" botnets sending the spam, those hosts are distributed more uniformly across the world than the services I actually choose to use day to day which exhibit greater locality to my part of the world. Therefore I am getting real data, but maybe not data that is especially insightful to my day to day network usage.
  • My results may not be reflective of general edge connectivity, I might just have lousy service.
  • The handshake latency should be dominated by the network, but the application at the other end plays a role too. Owned spam generators may not be running up to commerical grade mail client standards generally expected of Internet infrastructure.
I posed the open question if the results would be the same for other protocols. DNS and HTTP were of particular interest. This post is about DNS client performance.

I have now built a packet trace that covers 6 days and about 130,000 DNS request/response pairs. I was astonished there were so many in less than a week from a home LAN. The trace was taken upstream from my home LAN caching recursive resolver - so the redundancy was removed from the data where TTL based caching could do so. The 130,000 transactions were done across 8854 different servers - this was also an astonishing amount of diversity. It comes down to just 14 per server on average, I would have expected a lot more server reuse.

The results are indeed better than the SMTP handshakes. We are now dealing with infrastructure class servers and it shows. But frankly, the latency numbers are still surprisingly high - 41% of all lookups still take a very noticeable 100ms. Remember, also, that starting many webpages requires at least two uncached lookups (one from the root name servers and one from the zone's name server itself) - that can be a really long lag.

Here are the numbers, ranging from a best of 24ms, to a worst of 17 minutes. That latter number is certainly an outlier. 99+% of all transactions were complete in 401ms or less.

percentile latency (ms)
best 24

10 41
20 43
30 55
40 71
50 77
60 101
70 114
80 128
90 180
99 401
worst 1,039,356 (17 minutes!)

Any lookup that did not complete was not included in the dataset. That 17 minute one is fascinating - it is a genuine reply to the original lookup request of (probably generated by the spam filtering software who received a legitimate message from someone but was seeing if the name resolved as part of its spam scoring system) - it was not a reply to a client generated retransmission or anything like that. That request must have been buried in quite a queue somewhere! It is hard to imagine that the DNS client hadn't timed out the transaction by the time the response arrived, but the packet trace does not give any insight into that. These very long transactions are exceptionally rare - only 46 of the 130,000 transactions took more than 3 seconds.

  • As you see, the median is 77ms
  • The mean is 112ms (104 removing the 17 minute outlier from the data)
  • 41 percent of all transactions took over 100ms
For the curious, I generated the latency numbers using this little ad-hoc piece of C, linked off this page of wonder. The stats were just done with command line awk scripts.

What does it Mean?

I can draw some weak conclusions:
  • DNS latency is much better than TCP handshake latency on my mail server - indeed almost twice as good. It seems likely that is because DNS is dealing with infrastructure class servers (both in terms of location and function), whereas much of the email traffic was probably botnet generated spam out at the edges of the network - just like my host. So much for net neutrality eh, there are already multiple tiers of service in full effect!
  • Latency still sucks. 100ms round trips are deadly and common.
It will be interesting to see how the HTTP client handshake latency numbers compare. The DNS numbers suffer some skew away from the common usage patterns of the edge users because they are looking up email domains from spam and mailing list contributors, etc.. the HTTP numbers ought to be more pure in that respsect.

No comments: