by Sudarshan Srinivasan and Romil Verma
Over the past few decades, TCP has become the gold standard in ensuring reliable in-order delivery of data, and several real world protocols including HTTP, FTP and SMTP are based on it. However, TCP is not particularly efficient in terms of latency especially for short lived connections: its handshake consumes one RTT, and servers cannot start sending responses till the handshake finishes.
Thus, under normal conditions, most HTTP requests require two RTTs to be serviced, with no protocol data being sent in the first RTT. Figure 1 illustrates this graphically. This problem is further exacerbated by the fact that most web objects tend to be small enough to be sent within a few RTTs after handshake. In order to reduce the impact of this problem, HTTP 1.1 introduced persistent connections which could be reused to fetch multiple objects. However, Google determined that modern browsers tend to open several connections despite this: an analysis of Chrome statistics from 2011 determined that 33% of requests are made on a new connection despite support for connection persistence. This is due to several factors. First, browsers open parallel connections to accelerate page loads. Second, NATs terminate idle connections unless power-hungry keepalives are sent periodically. Third, mobile browsers tend to close idle connections early to save power.
This repeated handshaking causes a significant increase in web request latency, especially on connections with high RTTs. Such connections are becoming more common, with RTTs reaching hundreds of milliseconds for 3G/4G connections. According to Google, handshakes account for 8%-28% of the latency for cold requests, and 5%-7% of the latency even amortized across all requests.
This problem cannot be fixed merely by allowing clients to send data in their SYN packets. Such a solution would introduce security vulnerabilities: attackers could send SYNs to servers with spoofed source IP addresses to cause the server to perform malicious actions. This attack is illustrated in the figure below. This would not be allowed by regular TCP since the server does not begin processing packet(s) till the handshake completes (an attacker cannot complete a handshake with a spoofed source IP since he/she cannot accurately guess the server SYN sequence number).
TCP Fast Open
In 2011, Radhakrishnan et. al. proposed a mechanism called TCP Fast Open (TFO) to mitigate this delay by allowing clients to send data in the initial SYN packet, with a cookie-based authentication mechanism to prevent denial of service attacks. They demonstrated significant improvements in web page load times especially on links with high RTTs.
TCP fast open revolves around a “cookie”- which consists of the encryption of the client’s IP address under a secret key known only to the server. When a TFO-enabled client connects to a TFO-enabled server, it sets the “cookie request” option in its SYN packet. The server replies with a SYN-ACK packet containing the cookie for that client. The handshake then proceeds as normal. On future connections, the client sends the cookie as part of the SYN packet together with the first few bytes of data. The presence of the correct cookie indicates that the connection request is genuine (since the cookie cannot be forged by an attacker unless he has the server side secret key). The server can then initiate processing immediately without waiting for the handshake to complete. This is illustrated in figure 2.
Results from the original paper
The authors implemented their proposal in the Linux kernel. The implementation is available to all programs running atop kernel 3.6 or newer: however, a few minor source changes are needed to take advantage of TFO. Modern versions of Google Chrome support TFO both on the desktop and on Android, as depicted in figures 3 and 4.
The authors downloaded pages off the internet and served them using Google Chrome’s page-replay tool. They used dummynet to simulate connections with slow speeds and varying delays and examined page load times both with and without TFO. They summarized their results in Table 1 in the paper, which we reproduce in figure 5. The results suggest that TFO improves page load times between about 5% and 18% (with specific pages being boosted as much as 41%). The extent to which the load time is boosted depends largely on how complex the page is and how large the individual elements on the page are, with simpler and smaller pages benefiting more (since they spend a larger percent of their time waiting for the handshake rather than downloading content). They also examined the impact of TFO on server CPU load and found it relatively negligible
Our main goal for PA3 is to reproduce Table 1 (a copy of which is in figure 5 above). This is the most important result in the paper since the main goal of TFO is to reduce page load time, and this table captures the reduction achieved by using TFO on a variety of pages with varying RTTs. The other graphs illustrate either the impact that TFO has on CPU load (which was found to be negligible) or the impact that RTT has on latency in general (which is in some sense captured by the improvements seen in table 1).
Our results are summarized in the table above. We observed an improvement with TCP fast open as well. However our numbers differ from the original paper. This is to be expected since the web evolves over time and pages are updated frequently. The increase in page load times compared to the original paper is due to the massive increase in the amount of content served by these pages over the last half decade (the paper is from 2011). In particular, for Amazon several large images have been added since then, as we’ve shown below. We observed that the percent increase in performance due to TCP fast open is strongly affected by the number of resources loaded per connection, and by their size. This is expected since TCP Fast Open saves one RTT for each connection established; thus savings tend to be more significant when more connections need to be made to retrieve content.
Instructions to Reproduce:
Set up a c4.large instance on EC2 using the official Ubuntu 14.04 LTS HVM image (ami-5189a661 for the Oregon region for example)- screenshot shown below. Please click on the screenshot to zoom in. This image should be shown right on the Launch Instance screen (right under Quick Start, since we are not using a custom AMI).
Once the instance is running, ssh to it (default user ubuntu, ssh key auth with no password)
- install git using apt-get:
sudo apt-get install git -y
- clone the repository at https://bitbucket.org/sudarshans/cs244-assign-3.git to your home directory (Note that the path is very important- the repository must be cloned to the home directory into a folder with the name cs244-assign-3). For instance, this can be done via:
cd ~; git clone https://bitbucket.org/sudarshans/cs244-assign-3.git
- cd to the directory and execute run.sh to regenerate the results. Note that this will take about 45 minutes; please plan accordingly. This can be done using:
cd ~/cs244-assign-3; ./run.sh
- Open table_results.csv in the cs244-assign-3 directory to inspect the results and relative improvements.
Note that slight variation in the result is normal- you will not get the exact same result due to variance in EC2 NGiNX I/O performance, randomization in download order and randomization in the dummy net emulator (which uses kernel timers to simulate delays). This is in line with what was found in previous years as well: In 2014, Laura Garrity and Suzanne Sthatos stated “We found that on occasion, TFO showed no improvement (or was a smidgen slower). As a result, we feel that it is best to average across multiple runs.” and Curran Kaushik commented about observing extensive noise on another reproduction.
However your results should look fairly similar to what we obtained since we take the mean across 12 trials.
Also, you should be able to follow the same instructions to reproduce the results on any virtual machine or physical machine of your choice running Ubuntu 14.04 LTS server. However, we can only make guarantees about c4.large running on EC2 (though we have tested on both other cloud providers and VMWare backed virtual machines). Note that VMs running on a desktop/laptop computer tend to be more noisy than EC2.
We used dummynet to simulate connections with limited bandwidth and configurable RTT on the loopback link. Dummynet was selected over mininet despite our experience with the latter for two reasons:
- The original authors used dummynet, and we wanted to re-create their experimental setup as closely as possible
- Dummynet trivializes the creation of asymmetric links similar to what the original authors used
Dummynet consists of two parts: a packet classifier and a link emulator that can emulate multiple “pipes” in parallel. The packet classifier captures packets that match one or more configurable conditions and pushes them into “pipes”. The pipes themselves add delays and rate limit packets to emulate slow links. The figure below demonstrates the setup used
Pages were downloaded off the web using wget (with user agent set to Google Chrome’s user agent to ensure that the version of the page served matched what Chrome would see) using the -H and -p arguments to download all components needed to render the page (including images). The page contents were then gzipped (to simulate HTTP deflate encoding, which modern browsers including Chrome use when connecting to most websites). Finally, the page contents were overwritten with random bytes (since TCP performance depends largely on byte counts, this will not affect the performance- however, this allows us to store a cached copy of the pages used to generate the results on bitbucket, in our repository).
Pages were served using NGiNX configured to serve static content running on port 8000 on the loopback interface. The mget download manager was used to download these pages using 6 parallel HTTP connections (Chrome uses 6 parallel connections to download web content). Each connection was used to retrieve up to 2 resources (in line with the paper’s observation that every third request used a new connection on average). Dummynet was configured to intercept traffic to and from the NGiNX server and add delays and rate limiting as shown above.
The linux sysctl net.ipv4.tcp_fastopen was used to turn TCP Fast Open on and off. When switching TCP Fast Open off, we set the flag to 0. When switching it on, we set the flag to 519. The significance of this number is explained in the figure below:
As can be seen from the figure above, this value allows us to isolate the performance enhancements offered by TCP Fast Open. The original authors state clearly in the paper that all results were generated when the cookie was in the cached state; thus the use of 519 simulates this without actually requiring us to ensure the cookie is cached before generating results.
We show below tables that show how HTTP persistent connections affect the percent improvement achieved by TCP fast open. In general, since one RTT is saved for each connection, the saving depends on the number of connections established and the amount of time each connection lasts.
We demonstrate the real world effect of TCP fast open on page load time for Google.com as measured by Chrome’s page speed tool below. Google was chosen since it is one of very few websites that implement TCP Fast Open on the server side in the real world. Please note that a connection with high latency to Google was simulated in a virtual machine running Ubuntu to generate the images below.
Critique on the paper
Despite the rapid proliferation of techniques such as HTTP Persistence and HTTP Pipelining which reduce the number of connections established to download content, it remains clear that TCP Fast open is a useful addition to the TCP protocol. As our results indicate, TCP Fast Open is quite capable of delivering page load time improvements even in today’s internet. Additionally, TFO is a transport layer addition and could be used to reduce latency for other protocols built atop TCP such as SSH. Unfortunately, TFO is currently implemented only in the Linux kernel limiting the services to which it could be applied.
TFO could also help speed up the adoption of HTTPS. Currently, HTTPS wastes one extra RTT compared to HTTP due to the ClientHello and ServerHello messages. This means HTTPS can effectively only send data after wasting 2 RTTs on connection setup (one for TCP handshake, one for SSL handshake). TFO can reduce this back to 1 RTT (for SSL handshake).
The main challenge that TFO faces with respect to adoption in the near future is the development of newer protocols such as QUIC that allow 0-RTT establishment of encrypted connections. However, these protocols are experimental and not yet supported by production servers. Additionally, the use of UDP in the proposed replacements could hinder adoption.
We faced a few challenges when setting up our testbed. We describe the challenges we faced together with the steps taken to overcome them below.
First, we found that mget is not available in the Ubuntu repository. We worked around this limitation by writing a script to download the dependencies, checkout the mget repository and compile it. This script (install-mget.sh) is automatically invoked by run.sh
Second, we found that the version of NGiNX provided in the Ubuntu repository did not support TCP Fast Open. We worked around this using a script similar to the one for mget which is also invoked by run.sh
Third, we noticed that Dummynet’s linux port does not officially support newer kernels. We worked around this using the patch posted on the ArchLinux User Repository at https://aur.archlinux.org/packages/dummynet/ (we have written a script that downloads dummy net, patches it using that patch and installs it)
When we implemented TCP fast open using the workarounds mentioned above, we realized that we were not getting consistent results since the web pages being considered change frequently. We worked around this by downloading the web pages ahead of time and caching them in our repository (with the page contents replaced by random bytes as previously mentioned).
Despite applying this fix, we noticed that we were getting poorer results with TCP fast open enabled than with it disabled. After some debugging with wireshark, it was seen that this was due to insufficient buffering along the Dummynet pipe leading to congestion, packet loss and incomplete bandwidth utilization. This was fixed by increasing the buffering at each of the Dummynet queues.
Finally, we noticed significant noise caused due to the noisy neighbor problem on EC2. We work around this by averaging across 12 runs.
This work was supported by an EC2 educational credit grant for the CS244 course. All images in this report were made using Apple Keynote, VMWare Fusion and Google Chrome (for screenshots). The old amazon screenshot is from the Internet Archive