Nathan Tindall (ntindall @ stanford ) and Eric Theis (ertheis @ stanford)
Introduction & Motivation
As the Internet has evolved, TCP has emerged as the lingua-franca for many networked applications. TCP provides a reliable, in-order, delivery service between two networked nodes. In order to get data flowing between hosts, however, the vanilla TCP model requires two round trip times (RTTs) to negotiate a connection.
The vanilla TCP model, illustrated above, is:
- Clients send SYN packet to the server.
- Server responds with SYN-ACK
- Client sends ACK along with payload (data it wanted to send)
- Server responds with ACK along with a response payload (if any)
3 and 4 repeat until the connection is torn down. This protocol is acceptable for long lived flows, as the overhead of one RTT is fairly small relative to the duration of the entire connection. However, in the extreme case where the connection is torn down immediately after the server responds, the overhead is significant, especially with increased RTTs on mobile. As the internet has aged, the number of short lived flows has increased. HTTP requests, for example, are generally short lived flows, that are usually resolved quickly due to the relatively small size of internet objects (7.3 KB on average). However, the number of objects on each page is very high. The average size of an Internet page was 300KB in 2011, and has doubtlessly increased even more in the past 5 years. This means that web browsers must open a large number of connections in order to fully resolve pages, with round trip time dominating the amount of time it takes for pages to resolve. Furthermore, many connections that would otherwise be long-lived are killed by mobile devices trying to conserve power or NATs. In a world where the objects are small and the flows are many, the handshaking overhead can have negative implications for network latency and user experience.
TCP Fast Open
Radhakrishnan et. al propose a new scheme (Fast Open) that allows clients to send data with their first packet (SYN) to the server. Their goal is to describe a scheme that allows for data to be sent from server to client in the first round trip without introducing significant vulnerabilities to denial-of-service attacks. They also test the performance gains that this yields in practice.
In TCP Fast Open (TFO), the server has a secret key that it uses to issue cookies to clients by encrypting their IP address. When a client sends a TFO request to a TFO enabled server, the server responds with the cookie. Clients can then present the cookie on subsequent requests in their initial SYN packet. If the cookie verifies on the server, it will send the response payload in its first message to the client. Otherwise, it discards the packet.
The cookie is required in order to prevent denial of service attacks on the server. Without any authentication, an attacker could flood the server with SYNs and force it to spend time processing spurious requests. The cookie restricts the server’s processing time for clients that have made TFO requests and received valid cookies. However, it’s important to note that any TFO client can obtain a cookie, so additional mechanisms need to be in place to prevent resource exhaustion on the server (these are discussed in section 3.4 of the paper).
Results from Radhakrishnan et. al
The original authors used dummynet and Chrome’s page-replay tool to test the download latency for TFO-enabled Chrome measured against standard Chrome. On average, the authors found that TFO reduces page load time by 10%, with the amount of improvement scaling inversely with the complexity of the page (see table below).
The primary goal for our project is to reproduce the original paper’s table 1 (copied above). The table demonstrates a reduction in load times when using TCP Fast Open with the added security mechanisms, which is the primary contribution the paper makes. As an extension, we also investigated how much load times improve when using both TCP Fast Open and a larger initial congestion window.
We chose this table, as the primary benefit of TCP Fast Open is to reduce load times. The other figures in the original paper demonstrated a negligible difference (CPU load) or is captured by information in table 1 (more general latency). These are relevant things to confirm since the number of short-lived flows has likely increased over time.
We are interested in the combination of TCP Fast Open and the increased initial congestion window size, as we suspect these two optimizations will complement one another. This is because an init_cwnd increase also increases the impact of the handshake’s “fast open” by allowing more data to be sent to the client along with the SYN / ACK. It might even be possible to download the entire page in one RTT. Our hypothesis is that the benefit of TCP Fast Open will scale directly with init_cwnd, since more data can be sent in the initial handshake.
The risk with such a maneuver is that it could introduce congestion in the network. One way to address this might be to have clients use data on previous congestion window negotiations with the server to provide a heuristic for a congestion window that is less likely to cause congestion on the network path. Although this doesn’t accommodate variability in the network, it would reduce the chances that a client blast an extremely large quantity of packets when its previous communications with the endpoint has yielded high packet loss. This is an area for further research.
We decided to use dummynet, as the authors did, using nginx to emulate a TFO Enabled Server and running successive requests for pages from a dummynet client. We downloaded and cached static versions of the test webpages: Amazon, NYTimes, WSJ, and Wikipedia, in order to remove external network latency as an independent variable. We use mget instead of wget in order to emulate the behavior of Chrome, which will open multiple connections to the same host (unlike wget).
Averaging across twelve runs for each data point, our reproduction of table 1 is above. Our results are comparable with the results from Radhakrishnan et. al and previous year’s reproductions, although our results are even more sizable overall. Since TFO saves one RTT per request, this indicates that the web pages load even more resources per connection than they have previously. This confirms a central hypothesis of the paper which is that web page sizes will continue to increase.
Discussion & Critique
It is clear that TFO continues to provide a sizable improvement to page load time when using the TCP stack. However, as it is implemented in the kernel as a transport layer modification, it has a limited number of applications. The recent trend of implementing in-order delivery services outside of the kernel means that TCP usage is decreasing. New protocols, such as QUIC have been developed to specifically address many of the inefficiencies caused by redundant layering in the TCP/TLS stack while also providing 0-RTT connectivity. Furthermore, application layer protocols, such as SPDY have also made it much more difficult to avoid taking advantage of long-lived TCP flows. Possibly due to the large amount of development soon obsoleting this feature, it has not been deployed much outside of the Linux kernel and Google. Thus, while it may inspire handshakes in future-generation protocols, it is unlikely to have a significant impact on TCP performance in most systems.
Challenges & Attempted Extensions
Past years have done a lot of the legwork as far as overcoming obstacles, and for the most part we were able to avoid many of the things struggled with in previous reproduction attempts (such as getting dummynet configured properly for newer Linux kernels). Turning on TFO and adjusting the initial congestion window is very straightforward (just a few shell commands), since both are implemented in the kernel. The biggest challenges were just getting the dependencies installed and getting the test environment set up in order to run the experiment. Luckily, we were able to leverage some of the previous year’s installation scripts.
We planned to modify the initial congestion window to see how various window sizes affect performance with TCP Fast Open. We hypothesized that there would be an asymptotic limit to how much of an improvement a larger congestion window could provide (under the obviously weak assumption of a congestion free dummynet topology). This limit would be reached in the case where all data in transferred in about one round trip time (along with the SYN/ACK). It would have been interesting to see how RTT, page size, and resource fragmentation affected how quickly this asymptote is approached.
We used the same method as previous reproducers of the init_cwnd increase paper (see the chang_wnd function here) to modify the initial congestion window. This included turning off the TCP metric tracking/tuning functionality and raising the initial receive window. Despite this, we were unable to see significant changes in performance between runs with different congestion windows. We suspected this has to do with the fact that we were using the loopback interface to issue and serve requests. Every time we changed the window policies for the loopback interface, our connection to AWS was terminated and we were forced to restart the instance. We modified the HTTP GET command to use the IP address of eth0, but saw the same errant behavior. We suspect this may be a dummynet specific issue, and would encourage others to attempt the same test on a mininet topology (although this may cause issues with TFO, as reported by previous emulators).
- Set up a c4.large AWS EC2 instance with Ubuntu 14.04 as the OS.
- Login to the instance and install git:
sudo apt-get update sudo apt-get install git
- Clone the git repository into your home directory. It is important that this project runs in a directory named “cs244-assign-3” in your home directory.
cd ~; git clone https://github.com/ntindall/cs244-assign-3.git
- cd into the cs244-assign-3 directory and execute the run.py script with sudo privileges. This will take several (almost 3) hours to run, so please plan accordingly.
cd ~/cs244-assign-3; sudo python run.py
- Examine the results located in table_results.csv, our table is the result of the first five columns.
Note that you will see a slight variation in performance due to things like download randomization, EC2’s nginx performance variations, and randomization in the dummynet emulator. These variations were seen in previous years as well (L. Gerrity & S. Sthatos, S. Srinivasan & R. Verma). The 12 samples (which create that long runtime) should reduce the likelihood of a large difference between your and our results.