by Roan Kattouw and Max Meyers
In their paper “TCP Fast Open” (CoNEXT 2011; see also this LWN article about Linux’s implementation of it), Radhakrishnan et al. present a modification to TCP that allows the party opening the connection to send data in the SYN packet. This is a useful performance enhancement for request-response protocols: the client can put the first segment of the request in the SYN, rather than sending it after the three-way handshake is complete, saving one RTT.
The results in the paper focus specifically on HTTP requests on the web. These especially stand to gain from TCP Fast Open (TFO). HTTP requests are short and typically fit entirely in the SYN packet. This means the server can start sending the response immediately after sending the SYN-ACK, without having to wait for the client to ACK the SYN-ACK. Note that at this point, the three-way handshake isn’t complete yet, but the first segment of the response is already in flight! Furthermore, the process of loading a web page typically involves many short flows where some flows block on others (for instance, requests for resources can only be initiated after the browser has received the HTML telling it which resources to fetch), so saving one RTT per flow measurably speeds up this process.
Radhakrishnan et al. analyzed TFO performance for loading web pages by replaying page loads for four popular web sites: Amazon, New York Times, Washington Post and Wikipedia. They measured the page load time for each of these web sites, with and without TFO, for various simulated RTTs. They found that TFO decreases the page load time, and that the speedup TFO achieves (as a percentage of the non-TFO time) increases with the RTT. They also measured the overhead of TFO on the server side in terms of CPU utilization, and found that the overhead incurred by TFO is neglible and sometimes even negative (because there are fewer packets to process), and that the maximum request rate is actually higher when using TFO.
Subset goal and motivation
We set out to reproduce the results in Table 1 in the paper: TFO speeds up web page loads, and the speedup increases with the RTT. To reproduce these results, we wanted to capture and replay the HTTP requests for at least one web page, and measure the load time with and without TFO at different RTTs. We thought this was the most significant and interesting result in the paper. The other results in the paper are mostly about feasibility (can you do TFO without melting the server, what security precautions are needed to avoid opening attacks), but this result is about the benefits, the reasons why people are researching this in the first place.
As a secondary goal, we were interested to see if TFO would actually work in the wild, i.e. on hosts that are connected to each other through the real internet rather than a local network. This should be fine, but we wanted to check just in case there were middleboxes choking on the strange TCP traffic. Using hosts in different locations would also let us use real RTTs rather than simulated ones.
We first wrote a simple client program and a simple server program. These are in our git repository as
tfoserver.c. For our first experiment, we ran them on the loopback interface and analyzed the network traffic in Wireshark to verify that TFO worked. What we saw was consistent with the implementation details in the paper: the first connection does not have data in the SYN because the client doesn’t have a cookie yet, but in all subsequent connections we saw the client put as much data as would fit in the SYN packet. When we repeated this experiment with the client and server on different hosts, we saw the same behavior, which showed that TFO does indeed work on the real internet. In the second experiment we also saw the server respond before the client had completed the three-way handshake. We didn’t see this in the first experiment, but we believe that to be due to the very low RTT on the loopback interface.
To conduct measurements with web page loads, we wrote a client program that takes a specification of a request flow, as well as a server program that takes a specification of the matching responses. The formats of these specifications are documented in the README. We loaded a Wikipedia page (specifically, the article about Stanford) and the Amazon home page in Google Chrome, captured the traffic in Wireshark, and converted this data to request and response specs for these programs. These data sets are in our git repository as well, in the wikipedia and amazon directories. The Wikipedia dataset consists of 56 HTTP requests across 16 TCP connections (the browser strikes a balance between reusing connections and issuing parallel requests), with an average of 556 bytes per request and 14 KB per response, transferring a total of 830 KB. The Amazon dataset consists of 72 HTTP requests across 21 connections, with an average of 623 bytes per request and 9 KB per response, transferring a total of 705 KB.
We set up Amazon EC2 VMs in four locations: Northern California, Oregon, Northern Virginia and Ireland. We ran a server on the Northern California VM and ran a client in each of the other locations. The client pings the server 10 times to measure the RTT, replays the Wikipedia page load 20 times without TFO, then once with TFO to warm up (let the client get a cookie), then 20 times with TFO. It then repeats this process for the Amazon page load. We measured the time to completion for each replay, and averaged across both groups of 20 (the warm-up isn’t counted).
|Page||Client||Server||RTT (ms)||PLT (s)
Our results resemble those in the paper in that TFO has a benefit but we don’t see a strong relationship between the RTT and the speedup like in the results in the paper. We believe this might be due to higher variance in the RTT and other network conditions associated with using a real network. We noticed a higher variance in ping times and completion times than the authors of the paper did. However, we also don’t know the details of the traffic replayed to produce the results in the paper; there might have been more or fewer requests in their playback than in ours.
TCP Fast Open is a fairly new invention, and is only fully supported in Linux 3.7 (3.6 has client-only support). EC2 instances come with Linux 3.2 by default, so we had to get a custom kernel running in EC2. This required a bit of research but ultimately wasn’t hard to do, and we got Linux 3.7.9 running on our VMs faily easily. The only problem we encountered was that the system is configured to put the grub configuration in
/boot/grub/grub.cfg (so the kernel is added to that file when dpkg installs the kernel packages), but the version of grub that Amazon’s AKIs use internally looks at
/boot/grub/menu.lst . This means the newly installed kernel has to be added to
/boot/grub/menu.lst manually before it will boot.
Unfortunately, this does mean that the results can’t just be reproduced by cloning our git repository and running the script, but only by either using our AMIs (which have to be instantiated with the appropriate AKI to make the custom kernels work) or going through a relatively laborious setup process.
A second EC2 problem is that EC2 does not allow transferring any sort of resources between locations. So in each location, we had to create a new key pair, a new security group, and build a Linux 3.7.9 VM from scratch, because we couldn’t share key pairs or security groups between locations and couldn’t image a VM in one location and clone it to another. There are third-party scripts that do the latter, but they require setting up the EC2 API tools and getting an API key and need to transfer gigabytes of data (because they download an AMI to your machine and reupload it); at that point it was easier to just recreate the VMs from scratch. These limitations made setting up VMs in four locations relatively laborious. For reproducing the results, this shouldn’t cause too much trouble, except that there are four AMIs to instantiate, one per location.
There are a few issues with TFO support even in Linux 3.7. For one, it’s disabled by default (!), and has to be enabled by writing the value 3 (it’s a bitmap: 1 for client support plus 2 for server support) to
/proc/sys/net/ipv4/tcp_fastopen as root. Furthermore, the
MSG_FASTOPEN constants aren’t actually defined in the header files, so we had to define them ourselves, after figuring out the right values.
We chose to use real VMs in different locations rather than a simulation in Mininet because we wanted to build a more realistic test case than the authors did. However, this does make reproduction slightly more complex because of the unhelpful way locations are siloed in EC2, and it causes higher variance in the measurements due to real-world network conditions. Another complicating factor is the use of a custom kernel, but that’s unfortunately required for TFO to work at all.
The data set we built for Wikipedia reflects the load dependencies fairly well. We identified six stages in Chrome’s loading of the page and reflected those in our request specification. The Amazon page had a less clear dependency chain, so for simplicity we specify it as a two-stage process, with the main HTML page in the first stage and all other requests in the second stage, with the second stage blocking until the first 4000 bytes of the HTML have been received.
Instructions for reproduction
The simplest way to reproduce our setup is by launching our AMIs. Launch each of the following AMIs in the specified location with the specified AKI (“Kernel ID” or “Kernel image”):
|us-west-1 (N. California)||ami-464c6103||aki-f77e26b2|
|us-east-1 (N. Virginia)||ami-44ff612d||aki-88aa75e1|
NOTE: You must use the specified AKI, otherwise the custom kernel won’t boot.
For each instance, edit its security group to open up TCP ports 12345 and 12346 and allow all inbound and outbound ICMP traffic.
Once all the instances are up, SSH into them and verify the kernel version by running
uname -a . The kernel version should be 3.7.9, not 3.2.0. If it’s 3.2.0, you probably didn’t set the right AKI.
Choose one instance to run a server on. In our experiment, we ran the server on the us-west-1 instance. Edit
run.sh on each instance, and put the DNS name of the server instance in the
SERVER variable on line 4. Then run
server.sh on the server, and run
run.sh on each of the clients. You will find these files in the
For instructions on how to set up VMs without using our AMIs, see the
README file in our git repository.