Increasing TCP’s Initial Congestion Window


Team: Joseph Marrama and Jack Dubie.

Key Result: Increasing TCP’s initial congestion window can significantly improve the completion times of typical TCP flows on the Web.

Sources:

[1] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Natalia Sutin, Amit Agarwal, Tom Herbert, Arvind Jain. An Argument for Increasing TCP’s Initial Congestion Window ACM Computer Communication Review, 2010

[2] S. Ramachandran, and A. Jain. Web page stats: size and number of resources. 2010.

[3] Thomas Habets Optimizing Slow Start. 2011.

[4] Nandita Dukkipati and Nick McKeown. Why Flow-Completion Time is the Right Metric for Congestion Control ACM SIGCOMM Computer Communication Review Volume 36, Issue 1, January 2006.

Contacts: Joseph Marrama (jmarrama@stanford.edu), Jack Dubie (jdubie@stanford.edu)

Introduction:

TCP flow completion time has recently become an active area of research, as networks have increased significantly in capacity but typical users of the web haven’t seen a corresponding increase in performance. The majority of users judge performance primarily on TCP flow completion time (i.e., how long it takes for web pages to load), making it a very important metric for judging the quality of service provided by the internet. The majority of traffic that internet users produce consists of small and short lived TCP flows; the average TCP flow size is around 386kb as of 2010 [2]. Due to their small size, a significant portion of flows never leave the slow start stage and thus never reach their full capacity sending rates. This unneccesarily prolongs the lifetime of the TCP flows, decreasing the quality of service perceived by typical users browsing the internet.

In our project, we reproduce results from a paper entitled “An Argument for Increasing TCP’s Initial Congestion Window” that addresses this problem [1]. The main idea of the paper is very simple: increasing the initial rate at which hosts send data should improve the average completion time of TCP flows on the Web. Because most flows are short lived, increasing the initial congestion window should eliminate a large fraction of the round trips that short-lived flows require to complete. The paper proposes increasing the initial congestion (initcwnd) from the (previous) Linux default of 3 segments (~4KB) to 10 segments (~15KB), and presents experimental data gathered at frontend Google data centers to support this proposed improvement.

The figure from the paper above (henceforth referred to as the initcwnd paper) that we attempt to replicate is shown below in figure 2 below. In this figure, they plot the average improvement in flow completion time from increasing the initcwnd from 3 to 10 segments as a function of various aspects of clients’ connections. The top chart presents average improvement bucketed by clients’ round trip times (RTTs), the middle bucketed by clients’ bandwidth, and the bottom bucketed by clients’ bandwidth-delay product (BDP). Reproducing this figure will help us gain insight into what type of connections benefit from this change the most.

Methods:

We used Mininet to construct a very simple network of a client and server node connected by a single switch. We control the parameters of the network (RTT, bandwidth, etc.) by modifying the parameters of the link connecting the client to the middle switch. Unless otherwise stated, we use the average RTT and bandwidth of Google datacenter clients found in the initcwnd paper (70ms and 1.2Mbps, respectively) [1].

On the server node, we emulate a front-end Google server using a modified version of the python “SimpleHTTPServer” class that serves static content over HTTP. We made a simple modification to enable it to generate and send arbitrary length files on the fly. We use the ‘ip route’ tool to change the initial congestion window that the server uses, as described in [3]. On the client’s side, we use the unix utility ‘wget’ to download files from the server, and ‘time’ to determine how long it took.

Unfortunately, modifying the server’s initial congestion window via ip route doesn’t work on any recent stock Ubuntu kernel. We believe this functionality was broken sometime around Linux v2.6.39. Fortunately, more recent Linux kernels (v3.3 and higher) have fixed this issue. We decided to use Linux v3.3, because it is compatible with the most recent version of Open vSwitch (a dependency of Mininet). In order to install Open vSwitch, we also had to apply a minor patch to Mininet to inject the right dependencies.

Our methodology when running experiments is as follows. We first set up a network with the desired parameters and verify the RTT and bandwidth (described below). We then start the python server on the server node and increase the client’s default receive window such that it can handle the server’s maximum initial congestion window. Finally, we calculate the average amount of time it takes the client to download a fixed length file for different initial congestion windows. We use a default filesize of 28 segments (~42 KB) for all experiments. Google search result pages are generally around 90kb, and the google homepage is 12kb, so we aimed to have an average between the two.

Validation:

Upon starting a new Mininet network, we verify the RTT using ping and we verify the bandwidth by using ethstats and iperf. Validating the server’s initial congestion window turned out to be a much harder task. We employ a combination of heuristics and tcpdump to verify that ip route is behaving correctly. First, we can easily tell if ip route is actually working by measuring the amount of time it takes a flow to complete in a high RTT network under different initial congestion windows. If a small flow takes at least one RTT longer to complete with a low initcwnd than with an extremely high initcwnd, then the initcwnd must have actually been changed. Following this approach, we replicated figure 2 from [1] as a sanity check to make sure that this consistently works.

Flow Completion Time as a function of the size of the Initial Congestion window. We increase the RTT to be 140ms to exaggerate differences in FCT

To further verify that ip route was working, we inspected the actual packets sent in experiments using tcpdump and wireshark. This allowed us to see without ambiguity what the server’s initcwnd was and verify it was the initcwnd we set. For all trials, we verified that the initcwnd was set correctly. We would’ve liked to automate this process, but unfortunately we couldn’t find a way. Using the ‘tcp_probe’ module often produced inconsistent results from what we saw in wireshark, and most of the time it didn’t detect short flows. While it was tedious, we were able to verify the initcwnd in our final results with wireshark.

Results:

All graphs below show the absolute and percentage improvement in flow completion time gained from increasing the initcwnd from 3 to 10. Note that the absolute improvement is on a logarithmic scale.

This is the figure from [1] which we seek to reproduce.

Results from our RTT experiment

Results from our bandwidth experiment

Results from our BDP experiment

Our experiement varying the RTT in the network closely match that of the initcwnd paper. As RTT increases, we see the absolute improvement of the FCT grow linearly. Intuitively, this is because increasing the initcwnd from 3 to 10 should eliminate a constant number of round trips. There is no improvement in our experiment in a 20msec RTT network because those flows quickly become bandwidth limited. Increasing the badwidth from the default of 1.2Mbps to 100Mbps and repeating the experiment shows an improvement of about 20% with a 20msec RTT.

This contrast illustrates a key facet of the results in the initcwnd paper. In their graphs plotting FCT improvement as a function of some aspect of client’s connections, they *are not* controlling for other aspects of client’s connections. For example, their graph of RTT vs FCT showed a modest improvement in 20msec RTT connections because those connections generally have *more bandwidth* as well, and thus aren’t limited by it.

The results from our experiment varying the bandwidth are very different from those of the initcwnd paper, for precisely this reason. In our experiment, we kept RTT constant, whereas in the paper, the connections with low bandwidth also likely had high RTTs. Intuitively, increasing the initcwnd in a very low bandwidth and average RTT network *shouldn’t* change the FCT, because the flow will immediately be limited by bandwidth. The initcwnd paper itself acknowledges that low bandwidth clients generally have high RTTs, explaining the large improvement in their graph [1]. Unfortunately, the paper didn’t disclose the average RTT of the connections in each bandwidth bucket (although they did for another experiment on a slower data center). It would be trivial to assign a custom latency to each different trial such that our graph matches their graph, but this would be attempting to recreate the results of the paper in a disingenuous way, so we kept our original graph in instead.

Finally, our experiment varying the BDP matches reasonably well with the graph of the paper. Once again, the only trials that don’t match well are the lower buckets of 1000 and 5000 BDP. This is once again due to low BDP flows being bandwidth-limited, thus increasing the initcwnd does next to nothing. Curiously, the initcwnd paper’s BDP graph *does* report a speedup on the lower end of the BDP spectrum, despite the fact that these flows also have low RTTs. This is likely because an increased initcwnd promotes faster loss recovery. As the paper points out, a larger initcwnd increases the chances that a lost packet will lead to 3 duplicate acks because more packets are in flight at once, thus increasing the chance that the connection recovers the lost packet via Fast Retransmit rather than waiting for a timeout.

While this is very plausible, it’s very hard to test in Mininet with any sort of statistical significance. When we add in a high random drop rate (such that average flow completion time is still reasonable) and increase the number of trials substantially, we still can’t produce a dependable improvement. The initcwnd paper has the advantage of having over 10 billion timed TCP connections at their disposal, something we can’t reproduce.

Lessons learned:

Perhaps our largest takeaway from this project is an increased ability to debug problems down to the kernel. We spent the vast majority of our time trying to figure out why setting the initial congestion window via ip route wasn’t working. We tried everything we could before diving into the kernel, and even then we only managed to fix the issue by a stroke of luck. It was extremely fortuitious that setting the initcwnd via ip route was fixed in Linux v3.3. If this hadn’t been the case, it is unclear whether we would have had the necessary time to create a patch for the kernel. Installing Mininet with the new kernel also turned out to be unexpectedly hard, but it didn’t require nearly as much effort nor time to fix.

This project taught us the important lesson to not under-estimate the difficulty of research and the difficulty of getting experiments to work. At the beginning of the project, we set our sights on reproducing research from two papers: the initial cwd paper and a paper on ‘RCP’, a new transport protocol similar to TCP [4]. Implementing RCP would’ve required significant modifications to both the host’s networking stack and the switch’s stack as well. In retrospect, this seems like a near-impossible goal for a four-week research project. If we hadn’t run into the problem of setting the initcwnd, we would’ve had 3 weeks to do this, as we had the majority of our code in place to reproduce results from the initcwnd paper by the first week. Even if this was the case, we still likely wouldn’t have had the time to implement RCP. Considering the unexpected problems we ran into, RCP surely would have posed more unpredictable difficulties. Generally, undertaking a research project without a tested and proven way of setting parameters and conducting experiments *always* takes longer than expected. It’s purely a matter of chance whether things work as planned or not; unfortunately we were stuck with the latter case. Next time, we will have the wisdom to make a more realistic estimate of what we can accomplish in a certain amount of time.

Instructions to replicate this experiment:

See our github page at https://github.com/jdubie/mininet-tcp-initial-cwnd for better looking instructions

    1. Spin up our EC2 image in the AWS management console web application. AMI: ‘ami-56ba1a3f’    * note – the image is only available in the East Coast Virginia and you need to use a c1.medium image
    2. ssh into the new instance
    3. Clone our repo
      `git clone https://github.com/jdubie/mininet-tcp-initial-cwnd.git`
      `cd mininet-tcp-initial-cwnd`
    4. Install ethstats – `sudo apt-get install ethstats`
    5. Start open openvswitch – `sudo ./start_openswitch.sh`
    6. Reproduce figure 1 – sanity check – `sudo ./icwnd_vs_fct.py`
    7. Reproduce figures 3, 4, 5 – `sudo ./bw_improvement.py`

* Results will be printed to stdout and in `results/mininet_yours/latencies.pdf`
for the icwnd_vs_fct.py experiment and the `results/mininet_yours/figureX.pdf` plots,
where X is the number of the experiment

* scp `results/mininet_yours/{latencies, figuresX}.pdf` to your local computer for a bar graph.

2 responses to “Increasing TCP’s Initial Congestion Window

  1. Our team (Jitendra Nath Pandey and Pattabiraman (Raman) Subramanian) reviewed the report and followed the instructions to reproduce the results. The graphs obtained for BDP and BW match with the report but RTT. Some more explanation for the shape of the graphs would have been great for example why percentage improvement is almost constant for higher RTTs in Figure 1.
    The report is well written and explains the project effectively.

Leave a comment