CS 244 ’15 : An Argument for Increasing TCP’s Initial Congestion Window


Daniel Chiu & James Hong

1. Introduction

In their original paper, Dukkipati et. al lay out an argument for increasing TCP’s initial congestion window size [1]. Their motivation for doing so is to improve the average latency of HTTP responses since webpages have grown in size and smaller congestion window sizes incur avoidable latency overhead in the form of extra RTTs. (In fact, as the authors point out, many modern web browsers open many concurrent TCP connections to transfer data more quickly.) By increasing the initial TCP congestion window, servers are able to transfer the web page data much more quickly, reducing the need for inefficient hacks to circumvent TCP’s congestion control. Specifically, the authors argue that a larger initial congestion window of 10 segments is large enough to fit 90% of responses from several Google pages and services and most web responses from commonly trafficked websites.

In their experiments, Dukkipati et. al found that by increasing the initial congestion window to at least 10 (15Kb) segments, they were able to achieve an average latency improvement of 11.7% (68ms) in their average Google data center (a datacenter with a median bandwidth of 1.2Mbit/s and median RTT of 70ms) and an 8.7% (72ms) improvement in slow data centers (data centers with a median bandwidth of 500Kbps and median RTT of 70ms). They also performed experiments bucketed by bandwidth, round-trip time, and bandwidth-delay product and produced graphs showing the absolute and percentage latency improvement for different buckets of values (shown to the right). The authors found that average response latency improved across all buckets for all properties with the largest benefits for high RTT and high BDP networks and the smallest benefits for low RTT paths.

original_fig5

The original figure 5 published by Dukkipati et. al. [1] in their average data center (AvgDC). We replicate the top 2 plots.

2. Subset Goal and Motivation

We first replicate the top two graphs from figure 5 (above), which demonstrate the average and percentage improvements in web search “response latency” from an initial congestion window of 10 packets under a variety of bandwidth and RTT conditions. These two plots capture the crux of the authors’ argument that a larger initial congestion window delivers consistent, and reasonable, improvements in web server response latency.

In addition, we also replicate figure 2 (below), showing response latency with a range of initial congestion window sizes. This plot shows diminishing returns as window size increases past 10, and helps to justify the paper’s recommendation.

original_fig2

Original figure 2 by Dukkipati et. al. [1]

3. Subset Methodology

Our experiment is setup as follows. We define a mininet topology consisting of two hosts, acting as server and client, connected to a switch in between. For this experiment, we disable random packet drops and set the buffer size to be sufficiently large. We vary the bandwidth at a bottleneck link and initialize link delays according to the RTT. By default, our topology runs with an RTT of 70ms and a bandwidth of 1.2Mbit/s, the average values for AvgDC in the paper. We use the CUBIC congestion control algorithm at both the client and server, but this is configurable. In our topology, RTT and bandwidth are independent, whereas in the paper, they are highly correlated based on traffic at AvgDC; we discuss the consequences of this correlation in our results section.

On the server node, we run a simple webserver based on python’s BaseHTTPServer. Our code allows for two types of request handlers; one that serves statically sized pages and another that serves pages based on a 7-piece piecewise function modelling the probabilistic distribution of Google Web Search request sizes shown in figure 1, red line (below).

original_fig1

Original figure 1 by Dukkipati et. al [1]. Our server can simulate the CDF for Google Web Search if asked to do so.

To set the initial congestion window on the server, we use ip route and append the initcwnd flag to appropriate routing table entry. We also must set the initrwnd (advertised receive window) of the client using ip route as well; this value varies between versions of Linux, Windows and OSX. Since we assume the majority of Google users run Windows, we set this to approximate the Windows default value of 65,535 bytes or approximately 43 packets [2]. For our main simulation and for the distribution of Google web search results shown in the paper, the congestion window will almost never need to grow to be this large.

4. Subset Results

Overall, our results simulating figure 5 match the latency patterns reported in the original paper for typical RTT ranges and bottleneck bandwidths (the original paper provides the distribution of requests that fall under each bandwidth and RTT category). Shown below, our plots for figure 5 differ from the plots presented by Dukkipati et. al in two significant ways. For the following plots, we configure the webserver to serve 9Kb responses.

rtt

Our replicated response latency improvement vs. RTT plot from figure 5. Bandwidth was fixed to 1.2Mbit/s and 9Kb responses were served. Initcwnd was set to 3 and 10 packets. Note: for RTT=20ms, other factors dominated the response latency.

bw

Our replicated response latency improvement vs. bandwidth plot from figure 5. RTT was fixed to 70ms and 9Kb responses were served. Initcwnd was set to 3 and 10 packets.

First, for low bandwidths, 512Kbit/s and below, we report small or no absolute/percentage improvements in latency from increasing the congestion window, whereas Dukkipati et. al report large improvements. This is because in our topology, we hold minimum RTT constant at 70ms when varying the bandwidth. In the paper, bandwidths are estimated by subnet and are highly correlated with RTT; subnets with lower estimated bandwidths also had much higher minimum RTT on the order of 4-5 times the average of 70ms. Even though the initial congestion window is set to 10 packets, it is physically impossible to packetize 9Kb of data before acks begin coming back when bandwidth is set to less than 1Mbps and RTT is held constant at 70ms; this can be verified with tcpdump. As bandwidth increases, the absolute increase in response latency approaches one RTT (33% improvement) as expected, when responses are 9Kb.

Second, for very large minimum RTTs (greater than or equal to 1s), we found that when using the default linux TCP implementation, there was no improvement from setting the initcwnd to 10 via ip route. However, we found that this was actually due to the RTT being larger than the initial retransmission timeout, RTO, (for the SYN packet) on the client (1s). The server would receive duplicate SYNs, reply with duplicate SYN/ACKS, and the client would reply with duplicate ACKS, causing TCP at the server to drop the congestion window to 1 when it sends data, which we confirm with tcpdump and tcpprobe. When we recompiled the kernel with a larger minimum RTO, we were able to reproduce the latency vs. RTT plot from figure 5 exactly (shown below). Ultimately, this is not a major problem as less than 3% of requests in the Google AvgDC experiment had RTTs greater than 1s. Put into perspective,  the RTT for radio communications between the earth and the moon is roughly 2.6 seconds [4]. We believe that Google clients originating requests with greater than 1s RTTs may have been configured differently with larger initial RTOs to begin with. Barring this, the improvement we observe in latency is one RTT and approaches a 33% decrease as before.

rtt

Our replicated response latency improvement vs. RTT plot from figure 5 with the patched kernel. Like before, bandwidth was fixed to 1.2Mbits/s and 9Kb responses were served. However, the initial RTT estimate and minimum RTO were set so as to not interfere with the experiment at extraordinarily large RTTs.

To reproduce figure 2 (below), we vary the initial congestion window while holding RTT and bandwidth constant at 70ms and 1.2Mbit/s, respectively. Response sizes are returned using our piecewise simulation of Google Web Search responses. Shown below, our plot closely resembles figure 2 reported by Dukkipati et. al and shows that with a distribution of response sizes centered heavily around 9Kb, there are diminishing benefits for an initial congestion window larger than 10 packets.

cwnd

Our replicated figure 2 plot. RTT and bandwidth were fixed at 70ms and 1.2Mbit/s. Response sizes were varied from the simulated figure 1 distribution and 100 fetches were performed. The blue bars represent 1 standard deviation above and below the mean response latency.

size

Our custom plot showing that a initial cwnd of 10 yields a 1 RTT improvement in response latency if the response cannot fit within the default 3 packet cwnd. The percent decrease in latency drops as the response size is increased.

Finally, we include a plot of latency improvements vs. response size for initial window sizes of 3 and 10 (above). As expected, there is no difference when the response fits within 3 packets. When the response is too large for 3 packets, then the request from the server with an initial congestion window of 10 completes one RTT faster.

5. Challenges

We had the most difficulty reproducing the paper’s results for low bandwidths and extraordinarily large RTTs for reasons already explained above in the Subset Results section. We also do not have the information or resources to reproduce exactly the request distribution in the original paper. Instead, we varied one variable at a time, keeping others fixed; this led to consistent and explainable results, but frustrated our efforts to simulate the graphs. We did not attempt to reproduce the 3rd plot, latency vs. BDP, of figure 5 because BDP depends on RTT and bandwidth, which we did not have a joint distribution over.

When implementing and testing our code we encountered several challenges also noted by groups attempting to replicate figure 5 in previous years. For example, Jung and Quinonez report that sporadic ARP broadcasts caused irregular delays [3]. This was the case in our simulation as well and becomes especially problematic when link delays are long. We eventually traced the problem to overzealous default timeouts of 60s, which we then set to 3,600s. In a similar vein, when verifying connectivity with ping for the first time, the first ping will time out for very large RTTs due to the need to populate the ARP cache, but subsequent pings will succeed.

7. Critique

Overall, we found that Dukkipati et. al’s argument for a larger initial congestion window of 10 holds well in our simulated mininet topology. With an initial congestion window of 10 packets, average sized requests were able to complete one RTT sooner. The differences between our simulated results can largely be attributed to the positive correlation between RTT and bandwidth for real web clients (the authors mention that most low bandwidth traffic originates from dial-up modems and mobile networks) and variety in client TCP configurations (due to operating system, version differences, etc.). For average size web responses, setting an initial congestion window of 10 packets appears to be a simple and low-risk way to achieve faster response latency.

8. Readme

  1. Launch a new EC2 instance. Under Community AMI’s search for cs244-dwchiu-jamesh93-pa3 and create an instance with this image. A c3.large instance should be more than sufficient.
    1. If the Community AMI cannot be found (possibly waiting to be indexed by Amazon), it should suffice to make an instance off of the cs244-Spr15-Mininet AMI also available under Community AMI’s
    2. You will then need to clone the repository for our code available here: https://chiubaka@bitbucket.org/chiubaka/cs-244-pa-3.git
  2. Once your EC2 instance is up and running, SSH into it as the user: ubuntu.
  3. In the home folder for the ubuntu user, there is a folder called cs-244-pa-3. Navigate into this folder to find the experiment files.
  4. Run git pull to make sure you have the most up-to-date version of the experiment files.
  5. Run sudo ./run.sh to run the experiments. Results will be output to a timestamped results folder in the current directory.
    1. Please note that there is an exception raised when the script cleans up the Mininet topology. This exception does not affect functionality of the code and seems to be a bug related to Mininet’s interaction with the VM.
  6. Compare resulting images to the images available in the example-results directory. Results for RTT’s at or above 1000ms may not match because the kernel initial RTO (timeout value) is set to 1000ms by default and interferes with the experiment. (The AMI runs the unmodified kernel, since it doesn’t make realistic sense for an admin tuning initial cwnd at a server to configure client settings.)

References

[1] Dukkipati, Nandita, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agarwal, Arvind Jain, and Natalia Sutin. “An argument for increasing TCP’s initial congestion window.” Computer Communication Review 40, no. 3 (2010): 26-33.

[2]  Sajal. “Tuning Initcwnd for Optimum Performance.” CDN Planet. October 11, 2011. Accessed May 29, 2015. http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/.

[3]  Jung, Raejoon, and Stephen Quinonez. “CS 244 ’14: AN ARGUMENT FOR INCREASING TCP’S INITIAL CONGESTION WINDOW.” REPRODUCING NETWORK RESEARCH. June 3, 2014.

[4]  “Communication Delay.” Accessed May 29, 2015. http://www.spaceacademy.net.au/spacelink/commdly.htm.

Advertisements

2 responses to “CS 244 ’15 : An Argument for Increasing TCP’s Initial Congestion Window

  1. Running the code was straightforward: directions were clear, and setup was easy. The results produced by the code matched well with the blog post; the plot generated using a patched kernel was not produced by the code, as mentioned by the blog post authors in their Readme section. Per the rubric, we judge that “Code ran to completion and results matched very well with the blog post”, and thus assign a score of 5.

    In the paper by Dukkipati, measurements were made on an actual Google data center. The blog post authors used a Mininet setup with a single sender sending to a single receiver across a single connecting link that had a single intermediate switch on it. The validity of this emulation setup was, we felt, inadequately discussed by the authors.

    The blog post authors do a good job of exploring the differences between their emulation results and the original paper results—particularly in highlighting the negative correlation between flow bandwidth and RTT in an actual data center, whereas in their setup the minimum RTT was held constant with varying bandwidth. And the authors’ approach to modeling the web search load, using a piecewise approximation of the specified CDF of HTTP response size, was a seemingly intelligent approach.

  2. Thank you for the detailed feedback!
    We definitely considered the appropriateness of a single switch setup when designing the experiment.
    1. The largest difference is by far the precise distribution of request sizes, bandwidths, RTTs, and client configurations (advertised rwnd, initial RTO -‘should’ be set to 1 second by RFC6298-, etc.).
    2. Some additional care also needs to be taken for setting up the switch to represent the path; for example, disabling ARP broadcasts (since that would be equivalent to all the routers along the path suddenly performing ARP).
    3. There are also some subtle differences in measurement. For example, Google front-end machines complete the handshake, receive the request, send the request to backend servers, and then forward/repackage the backend response to the client. The paper makes it appear that the only the latency of the final step of sending the response to the client is reported, and the measurement itself is taken at the server. By contrast, in our setup, the measurement is made by curl on the client. This would mean that (numeric) percent improvements between our graphs and theirs are not directly comparable, though the trends are.

    Overall, we believe that there are advantages and disadvantages to both our simulation and DC measurements. In both the paper and simulation, it is clear that the main mechanism for latency improvement is an RTT of savings if the response is larger than the original initcwnd. The simulation demonstrates that (in theory) bandwidth and RTT matter mostly so that they don’t interfere with this 1 RTT saving (though timeouts or bw being sufficiently large). By contrast, the Google results are more realistic, and noisy, given that not all requests/responses/clients meet the constraints for a RTT improvement, the size of which is variable in itself.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s