CS244 ’17: An Argument For Increasing TCP’S Initial Congestion Window


Samuel Colbran (samuco@stanford.edu)

Introduction

In order to prevent the Internet from collapsing under times of high load, many TCP congestion control algorithms utilize the concept of a congestion window. One such algorithm is Cubic, which is used by default on many Linux distributions. The goal behind this is fairly simple – rather than only considering the receiver window size, additional consideration is given to whether the network is congested, which allows the sending rate to be adjusted accordingly. The receiver window size indicates the number of bytes still free within the receivers buffer. The congestion window on the other hand is designed to limit sending to avoid overrunning routing devices along the network. The congestion window typically starts at a size of three to four data segments. Cubic initially uses a ‘slow start’ algorithm which increases the congestion window each time a data segment is acknowledged. When congestion has been detected (packet loss, timeout etc), Cubic switches to a ‘congestion avoidance’ phase.

The paper “An Argument for Increasing TCP’s Initial Congestion Window” [1] argues that the initial maximum TCP congestion control window size of 4 segments set in 2002 is out of date and should be increased to at least 10.  When a browser opens a connection (or usually several to mitigate this and other problems) to a server and requests a webpage, TCP starts slow with the small congestion window size and then exponentially increases as acknowledgements come back. However, webpages are generally very small, so the transfer and corresponding connection are usually finished by the time that the congestion window has opened to a reasonable level. These short lived flows are often also competing against longer lived flows that have already opened their congestion window to achieve flow-rate fairness. The authors propose that if the initial congestion control window size was bigger, 90% of web objects could be sent in one RTT, resulting in significantly less latency. On the other hand, if the initial window is too large and the networks become congested, more packets will be dropped and more retransmissions would be required.

The authors found that by increasing the initial congestion window to at least 10 segments, users connecting to an average Google data centre with a median bandwidth of 1.2Mbps and RTT of 70ms saw an average latency improvement in Web Search of 11.7% (68 ms). An improvement of 8.7% was also seen for their slow data centre where the majority of users have poor internet connectivity and the median bandwidth is a measly 500 Kbps. The experiments conducted by the authors utilise log files that recorded each time a user accessed the data centre and the minimum TCP RTT that was seen during the connection. These results were bucketed by RTT, bandwidth and bandwidth-delay product to produce figure 5, which is shown in the next section. The authors conclude from these results that a larger initial congestion window of size 10 segments is more beneficial than 3.

The average latency improvement of Web search in AvgDC is 11.7% (68ms) and in SlowDC 8.7% (72ms).

Subset Goal and Motivation

I aimed to replicate graphs like those in figures 1, 5, and 7 of the paper using emulated end hosts. These are included below.

Figure 1 shows the HTTP response sizes for various websites and I was curious to see how this has changed between 2010 and 2017. If the majority of response sizes have increased beyond the proposed 10 initial congestion window size, further research may be warranted.

Screen Shot 2017-06-01 at 3.09.48 pm.png

Figure 1: CDF of HTTP response sizes for top 100 sites, top 500 sites, all the Web, and for a few popular Google services. Vertical lines highlight response sizes of 3 and 10 segments. 

Figure 5 is the main figure of the paper (which in reality contains 3 figures) demonstrating the improvement in response time for configurations of bandwidth and round trip time parameters as the initial congestion window is changed from 3 to 10.  A consistent improvement across the board speaks for itself and presents a good argument towards increasing the initial congestion window size.

Screen Shot 2017-06-01 at 4.29.33 pm.png

Figure 5: Average response latency for Web search bucketed by RTT, BW, and BDP at AvgDC. The left y-axis shows the absolute latency improvement (in ms), and the right y-axis shows the percentage improvement. The buckets are represented in the lower x-axis scale, while the numbers at the top of the plot represent the percentage responses in each bucket. The buckets are named by the upper end of RTT, BW, BDP.

Figure 7 demonstrates the improvement in response time for various HTTP web object sizes.

Screen Shot 2017-06-01 at 4.29.43 pm.png

Figure 7: Average response latency at AvgDC for Web search bucketed by number of segments. The left y-axis shows the absolute improvement (in ms), and the right y-axis shows the percentage improve- ment. The buckets are represented in the lower x- axis scale, while the numbers at the top of the plot represent the percentage responses in each bucket.

Subset Results

The paper is difficult to replicate in a real environment without the vast resources that Google has. I do not have access to a CDN with 5.5 billion request/response data points from a wide variety of users that each have different network connectivity. Instead, the experiment has been implemented using mininet as a means to automate network setup and experimentation. This approach allows for fast reproducible results (albeit less representative of reality) that can be realistically obtained given my budget.

Topology

The network topology is very simple and setup as shown below. It consists of two hosts connected by a switch with configurable link speed and delay. The AvgDC network is simulated by setting each link to have a delay of (RTT / 4 = 70 / 4 = 17.5) ms and bandwidth of 1.2 Mbps. The link and host parameters are changed dynamically during the various experiments using the techniques described in [2]. topology.png

A huge downside to this approach is that all correlation between client RTT and corresponding bandwidth is lost. A client loading content from the CDN over dialup is likely to have a higher RTT than one on Google Fibre. In the simulated environment these variables are completely independent. For this reason, I decided to focus on the AvgDC network of the paper rather than SlowDC. The only difference between these networks is that SlowDC serves subnets with a larger proportion of low connection bandwidths. AvgDC has a median RTT of 70ms and average bandwidth of 1.2 Mbps. The difference between figures 5 (second subfigure) and 6 is not measurable in the experiment, as the RTT in both cases is fixed to 70ms. The results in figure 6 of the paper are therefore not reproducible in a meaningful way within the simulation as they rely on the real correlation between variables and I did not see any way around this problem. For each experiment, I ran each parameter configuration multiple times over a minute and averaged the results.

Response Size (Figure 1)

As the paper was published in 2010 I was curious to explore how HTTP response sizes have changed in the past 7 years. To conduct this experiment I created a node script that downloaded the html for the top 500 websites according to moz.com. For each website, the script parsed the html, determined the resource paths for images, stylesheets and scripts and then downloaded them too. The response sizes for each of the downloaded objects were then plotted using Python.

The results paint an interesting picture about how the size of HTTP Web objects has changed over time. In 2010 the paper proposed an increase in TCP’s initial congestion window to at least ten segments based on the knowledge that it covered approximately 90% of HTTP Web objects, but my result would suggest that perhaps it should be increased further given the shift in response sizes. This result doesn’t surprise me – as global bandwidth increases there is a tendency to pack more and more into websites.

top500.png

CDF of HTTP response sizes for top 500 sites. Vertical lines highlight response sizes of 3 and 10 segments.

Web Search (Figure 5)

The various web search experiment replications work as follows. I simulate ‘Web Search’ using real web objects that I downloaded from a “hello world” google search in Safari on Mac. The client asynchronously downloads each of the web objects from the simulated server. This will not produce exactly the same results as the experiment (aside from the simulation and lack of real users) simply because the size of ‘Web Search’ is likely to have changed between 2010 and 2017. I could try to find an archived version of Google but I prefer using the modern page as it provides a new insight into whether the results in the paper still hold up in 2017. The modern search contains many web objects that are much larger than 15 KB, but this works in reality as many of them can be cached.

Screen Shot 2017-06-02 at 1.08.53 pm.png

RTT – BW is fixed to 1.2Mbps and RTT is configured to the values of 20, 50, 100, 200, 500 and 1000. I did not include a number larger than 1000ms as I had implementation issues (see challenges).

Figure5_RTT.png

The raw numbers for the RTT test are different (likely a byproduct of using the modern Google ‘Web Search’ objects) but the trend is consistent with the report. It is interesting, though not surprising, to see that the results of the original paper still remain relevant today.

BW – RTT is fixed to 70ms. Bandwidth is configured to the values of 256, 512, 1000, 2000, 3000, 5000, 10000, 20000, 50000, 100000, 200000.

Figure5_BW.png

The trend in the bandwidth graph does not match the paper but this is most likely due to real world correlations between bandwidth and RTT that were not captured in the simulation. In reality a client loading content from the CDN over dialup is likely to have a higher RTT than one on Google Fibre. The samples in the bins to the right of the graph in the original paper likely have better RTT in practice than those on the left. As I have shown in the RTT figure, higher RTT results in more improvement. However, it should be noted that the bandwidth converges to a constant improvement as does the result in the paper. The bandwidth convergence numbers are larger due to the larger modern web objects.

I did not include 56 Kbps in the result (as in the original paper) as it took a very long time to run with the larger modern search web object sizes. The ‘Web Search’ totals 1.2 MB and would take 3 minutes to download. This also caused a wide variation in results.

BDP  – The BDP experiment is not very meaningful given that no correlation exists between the variables in the simulation (so it’s basically the same as the BW experiment) but I included it for completeness. The RTT is fixed at 70ms and BW is configured to have a matching BDP of 1000, 5000, 10000, 50000, 100000 and 200000.

Figure5_BDP.png

The trend in BDP graph is basically the same as the bandwidth graph given the lack of correlation in the simulation between RTT and BW.

Segment Size (Figure 7)

BW is fixed to 1.2Mbps and RTT is fixed to 70ms. The client downloads various pages from the server with sizes of exactly 3, 4, 7, 10, 15, 30 and 50 segments (MTU=1500).

The results show a constant improvement when the segment size is greater than 3. Below this, the response is unable to fit within the initial congestion window size of 3 as expected and thus requires another few round trips (compared to the larger initial congestion window size of 10) before the window is open enough to transmit the data. This is consistent with the results in the paper when one accounts for noise.

Figure5_SEG.png

Challenges

I had a lot of trouble creating the top-500 website scraper. I initially created a version in Python but this suffered from many weird Python internal segmentation faults or double frees. I abandoned this approach due to the debugging impossibility and started working on a Node.js version instead. This was more successful but I am a little afraid of running my experiment too many times on the Google Cloud infrastructure given that it opens connections to a large number of websites in a short amount of time and my account might be banned for being a bot.

The main challenge was creating a simulation that accurately resembled real life. I don’t have access to the same resources as Google and the mininet simulation with constant RTT and Bandwidth parameters is a poor substitute. I was also unable to achieve decent results for an RTT above 1000ms (whereas the paper includes 3000ms and >3000ms) due to weird Linux kernel timeouts. I was able to make it somewhat work by messing around with various parameters shown below, but the experiments took a long and impractical time to run and results varied significantly, which is why in the end I decided to exclude them from my figure.

 os.system("sysctl -w net.ipv4.tcp_retries1=100 > /dev/null")
 os.system("sysctl -w net.ipv4.tcp_retries2=100 > /dev/null")
 os.system("sysctl -w net.ipv4.tcp_frto=100 > /dev/null")
 os.system("sysctl -w net.ipv4.tcp_frto_response=100 > /dev/null")

Critique

In my simulated mininet topology I found that even when using modern Google Web objects, using an initial congestion window size of 10 segments produced substantial improvements compared with when it is 3. My results agree with Dukkipati et. al’s and the differences in graph scale and trend can be attributed to the different web object sizes, a total lack of simulated correlation between bandwidth and RTT, and the fact that I am using a simulation rather than real data. My figure 1 result challenges whether 10 segments is a little outdated in 2017, but further research using real world numbers, as in the original paper, would be required before making a concrete decision. In any case, an initial congestion window size of 10 is a step in the right direction as it minimises latency across the board with minimal added congestion and the minimal deployment cost of a software patch.

Extensions

My original plan was to replicate figures 5, 6, 7 and 2 if I had time, but I realised that replicating figure 6 was pointless. The result shown by figure 2 has already been validated in the past by other students [3] so I wanted to try something new. I decided instead to scrape the web to see how the HTTP response size has increased over the years and I was glad to see some meaningful results. I would be interested to see figure 2 created using samples from this distribution or with the new web objects I obtained from a recent Google search, but alas I do not have any time and am only a team of 1.

Platform

I chose mininet as my platform to ensure that the entire setup was highly reproducible given that it would be running in a simulated and (mostly) deterministic environment. I chose Ubuntu as it is a solid Linux distribution that I am familiar with, and I selected the smallest Google Cloud instance as it was cheap and I did not need additional compute power. Google Cloud was very easy to setup and use.

The most chance for variability in the results is caused by mininet and operating system scheduling, but averaging multiple results should filter this out. A higher number of averaged experiment runs will increase reproducibility but will also take longer. In the end, I decided a minute per parameter choice was a suitable tradeoff. I have run my script on multiple clean instances and achieved the same result each time.

Replicate My Results

Note: The following section assumes that you are already familiar with Google Cloud and know how to set up, connect to, and manage an instance. If you don’t know how to do this please see https://cloud.google.com/compute/docs/quickstart-linux. If you are still stuck feel free to contact me.

Google Instance

Set up an instance on Google cloud with the following configuration.

n1-standard-1 (1 vCPU, 3.75 GB memory)
Ubuntu 17.04 (GNU/Linux 4.10.0-19-generic x86_64)
Allow HTTP/HTTPS traffic

Running the experiment

SSH into the instance and run the following command

git clone https://github.com/ShadovvMoon/CS244.git && cd CS244 && chmod +x ./run.sh && sudo ./run.sh

The resulting figures will be output to the results directory. Using the standard instance it should take approximately an hour to run. If you encounter any issues please contact me.

To copy the results off the instance you can use the following command (populate the various fields with your own instance settings). This should take care of public keys etc.

gcloud compute copy-files --project "cs244-167318" --zone "us-central1-c" <your user>@cs244:~/CS244/results ~/Desktop/results

Sources

[1] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agarwal, Arvind Jain, and Natalia Sutin. “An argument for increasing TCP’s initial congestion window.” Computer Communication Review 40, no. 3 (2010): 26-33.

[2]  Sajal. “Tuning Initcwnd for Optimum Performance.” CDN Planet. October 11, 2011. Accessed May 29, 2015. http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/.

[3] Daniel Chiu and James Hong. “CS 244 ’15: AN ARGUMENT FOR INCREASING TCP’S INITIAL CONGESTION WINDOW.” REPRODUCING NETWORK RESEARCH. May 29, 2015.

Advertisements

One response to “CS244 ’17: An Argument For Increasing TCP’S Initial Congestion Window

  1. Reproducibility Score 5/5 – Code ran to completion and results matched very well with the blog post.

    Most figures we got closely matched what was published in the blog post.

    There was however a (small) difference in one of the plots. In the figure of average response latency for web search, we see no improvement for 256 Kbps bandwidth when repeating the experiment.

    Our experiment reproduced figure: https://github.com/jervisfm/tcp-init-cwnd/blob/70d5990ec0d1f3f1fa43c135c87755b9d5a2b35d/results/Figure5_BW.png

    Blog post figure: https://reproducingnetworkresearch.files.wordpress.com/2017/06/figure5_bw2.png

    We asked the author what might be contributing factors here and they mentioned potential issues with cloud cpu scheduling, effects from a noisy neighbor running next to us or insufficient experiment duration for the low bandwidth case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s