Sam Crognale (crognale at stanford dot edu)
Judson Wilson (judsonw at stanford dot edu)
This paper  begins by making two observations about ISP networks today. First, networks are deliberately over-provisioned to account for potential link failures, and second, the TCP congestion control mechanism that drives these networks is designed to ramp up cautiously. Even when links are at low utilization, TCP flows must increase their congestion windows slowly. A large volume of internet flows last long enough that slow-start is a significant fraction of the overall life of the flow, forcing the flows to spread packets across many more round trip times than if the flows initially consumed all link capacity. This translates to a perceivable performance difference to the end-user, such as delayed output when browsing the web.
The significance of the problem is large enough that this has become an active area of research. Other proposed solutions include increasing TCP’s initial congestion window [2,3], TCP QuickStart , QoS (Quality of Service) and RCP (Rate Control Protocol) . Each of these has unique drawbacks, such as performance limits, or requiring design changes to routing equipment. Thus, it is worthwhile to explore whether a protocol can be designed to improve network throughput and flow completion time while remaining sufficiently “safe” to prevent network collapse, all with existing hardware.
This paper proposes an approach called Recursively Cautious Congestion Control (RC3) to more aggressively use available network capacity during TCP slow start, and thus decrease flow completion times. The approach uses two control loops and priority queuing. At the sending host, packets at the front of the transmission buffer are placed in a priority queue of normal priority, where they are transmitted by a standard TCP control loop. Packets at the end of the buffer are fed to lower priority queues, where exponentially increasing numbers of these packets are placed in queues of descending priority.
Figure 1: Sources of packets from transmit buffer for each control flow type and queue priority .
When the link is not saturated by the normal priority packets from the TCP control loop (e.g. during TCP slow start) the lower priority packets are transmitted using a simple control loop which does not attempt retry on time-out. The two control loops process packets from the beginning of the buffer and the end of the buffer until they meet. The TCP selective ACK (SACK) feature must be enabled because packets are not transmitted in order. Low priority packets that are dropped will be retransmitted by the normal priority TCP control loop once the TCP controller advances to that un-ACKed packet in the buffer.
Packet queue priority is maintained across the network using the TCP ToS bits, in what the authors call a “perverse notion of quality-of-service, called WQoS, in which we assume ISPs are willing to offer worse service if certain ToS bits are set in the packet header.” The authors claim that congestion collapse is avoided because packets compete fairly at their assigned priority level, and the highest priority level uses TCP flow control.
The authors present a wide array of results, with a focus on Flow Completion Time (FCT). They begin by presenting a simple performance model for relative improvement in FCT over TCP. The model “ignores issues of queuing, packet drops, or the interaction between flows.” The model predicts a reduction in FCT for flows larger than the initial congestion window size, with a peak percentage reduction for flows that can complete in one round-trip-time using the available bandwidth “A”.
Figure 2: Performance gains as predicted by a simple model for RC3 and an increased initial congestion window.
Generally, the authors make the following claim: “In common wide-area scenarios, RC3 results in over 40% reduction in average flow completion times, with strongest improvements – more than 70% reduction in flow completion time – seen in medium to large sized (100KB – 3MB) flows.”
First, they demonstrate performance over a wide range of tests performed mainly in an NS-3 simulation of Internet-2, configured as 10 routers each connected to 10 hosts, with 30% average link utilization. They begin with a baseline comparison of RC3 to TCP New Reno, which produced results with “gains of 40-75%.” They follow up with many more tests, including tests on various topologies, link bandwidths, link loss rates, using different numbers of low-priority queues, with and without pacing. They also compare performance to other solutions: increasing the initial congestion window size, using traditional quality of service (QoS), and using RCP. RC3 generally provides better performance than other techniques, or in the case of RCP, similar performance without requiring modification to routers beyond enabling priority queuing (which is already widely available).
To further show that the results had real-world validity, the authors also performed similar FCT measurements on real hardware, using an RC3 implementation in Linux, and compared them to NS-3 simulations. The RC3 protocol was installed on two servers, each with two 10Gbps NICs connected by a 10Gbps Arista datacenter switch which supports priority queues. The netem program was used to simulate 10ms link latency. Tests were performed at 10Gbps and 1Gbps. They demonstrate that their NS-3 simulations approximate the real world flow completion times for TCP and RC3, although the Linux TCP performance is lower than expected in some cases, due to delayed ACK behavior. Generally speaking, the reductions in FCT for different flow sizes is similar to the results of the other simulations described above.
Subset and Motivation
We aim to reproduce most of the test results of section 5.2 of the paper, obtained using the real servers and switch, as described above. We will use Mininet instead of real networking hardware. We will not reproduce tests that require hardware or driver features, such as segment and receiver offload. We aim to verify the results of Figure 15, comparing the flow completion time of several flow sizes using both regular TCP New Reno and CUBIC and RC3.
The paper also describe a simple procedure to test the priority queuing functionality at the hosts and switch, with graphical results in Figures 16 and 17. Priority queueing in Linux was tested by connecting two hosts via a direct link and observing the behavior of low versus high-priority traffic. Priority queueing in the switch was tested similarly, but with two sending hosts sending flows of different priority to the same receiving host through a switch. We will duplicate these test procedures on our Mininet test bench as well.
The protocol proposed in this paper is quite interesting and new, so the paper itself is a prime candidate for investigation, especially with the claims it makes. Unfortunately, many of the other tests in this paper are either underspecified, or too demanding to test in a single host in Mininet without significant compromise. The Internet-2 simulations were performed using 100 hosts, and several 10Gbps links, which we feel would be challenging, if not impossible, to replicate in Mininet.
The tests of the Linux implementation using real world hardware appear much more realistic for the Mininet environment. These tests also appear to demonstrate many of the same behavior characteristics of the larger simulations, and should give a good indication of whether the protocol provides the benefits it claims.
First, we were able to show that our implementation of priority queueing matched the expected behavior for both Linux end-hosts and switches:
The above figures are intended to reproduce Figures 16 and 17, respectively. The correctness of these figures ensures that flows marked as high priority will have precedence over flows with low priority, at both the hosts and the switch, which is critical for the proper functioning of RC3.
As for the main results, we reproduce Figures 15(a) and 15(b) below with the original data, adding two additional columns for our Mininet implementation. It should be noted that there are differences in methods used for “Simulated”, “Real”, and “Mininet” flow-completion-times, with regards to how timestamps are recorded in the message sequence. The first-author of the paper told us that there are differences between the “Simulated” and “Real” methods, and that offsets were applied on the order of 1/2 RTT to make the results comparable. In simulation, it is acceptable to use timestamps taken from both hosts, but in real life it is difficult to use timestamps from different hosts, so single-observer methods are more appropriate. We chose to measure FCT by calculating the time elapsed between a user-space program writing the flow to the socket, until a 1 byte acknowledgement token from the other host is received (also from a user-space program). We then added 1/2 of an RTT, using the specified link delays, to produce comparable results. We also ran each test 10 times instead of 100 times, because the results seem to be more stable, and our time scaling would make the test duration very long.
In Figure 15(a) we attempted to simulate the results for 10Gbps with 20ms RTT using 100Mbps link rates, and 2000ms RTT, effectively scaling time up by a factor of 100. Our flow-completion-times were then scaled inversely to compensate. The graph shows results with TCP Cubic, but the results for TCP Reno are practically identical.
Figure 15(a) – “Simulated” and “Real” flow-completion-times for 10Gbps line rate with 20ms RTT (copied from the paper), and our “Mininet” results at 100Mbps and 2000ms RTT.
Our results from Figure 15(a) closely match the the simulation results, but not the “real” results. This is to be expected. As stated in the paper, “The baseline TCP implementation in Linux performs worse than in simulation because of delayed ACK behavior in Linux: when more than two segments are ACKed together, it still only generates an increase in congestion window proportional to two packets being ACKed.” This coalescing of ACKs should not occur at our enlarged time scales. A more direct comparison is Figure 15(b), below. Again, results for TCP Cubic are shown, but results for TCP Reno are practically identical.
Figure 15(b) – “Simulated” and “Real” flow-completion-times for 1Gbps line rate with 20ms RTT (copied from the paper), and our “Mininet” results at 100Mbps and 200ms RTT.
In this test, the authors reduced the line rate by a factor of 10 to 1Gbps using the Linux hierarchical token bucket queuing discipline. We kept our line rate at 100Mbps, but decrease delay by a factor of 10 from our previous test. This is closer to the configuration in the paper. Also, decreasing the line rate and keeping delay constant caused our “Regular TCP” FCT test results at 140,000 byte flow length to grow by multiples. We are unsure why, but observe that 2 second RTTs are very large and there may be features of the TCP implementation that trigger in these conditions.
Our Mininet setup performed nearly identically to the original simulated and real experiments for both TCP and RC3, although they followed the simulated results more closely. We believe the results follow the simulations more closely because our enlarged time scales reduce the effects of various latencies in the system. Most importantly, our experiment saw similar reductions in average FCT for large flows as reported in the paper. Thus, at least for this simple two-host experiment, we found that we could successfully reproduce the original results.
Our experiments show, to our satisfaction, that the protocol can decrease flow completion times in networks which are not congested, especially those with large RTTs where slow start can severely limit the number of packets “in the network” for extended amounts of time. The idea itself is pretty simple and convincing on its own right, and the results further boost our confidence. It would be interesting to explore how issues of fairness in the low priority queues play out in the real world, and other less obvious properties of the protocol. We believe others should verify performance in the presence of multiple competing heterogeneous flows, in terms of rates and RTTs.
We faced the following major challenges in implementing our tests:
Kernel Modifications: The authors of the paper provided us with a snapshot of their RC3 code, using a 3.2 kernel version. The parent Linux distribution flavor is unknown. Our mininet virtual machine uses Ubuntu 14.04.2 LTS with a 3.13 kernel. We had to find the individual changes and port them to a the kernel we intended to use. We also had to make our patched kernel run on Amazon EC2. Amazon’s custom kernel documentation which we found targets VMs using paravirtual (PV) virtualization. We found our existing VM is a hardware virtual machine (HVM), and after some reasoning concluded that we could simply patch the kernel as if it was a normal machine.
Queueing: For our Mininet based test of RC3 we need rate limiting, added delay, and priority queueing — a combination that does not appear commonly, but all which are typically implemented using the Linux Traffic Control program, tc. Conceptually we knew that we needed the properties of the hierarchical token bucket, the prio qdisc, and netem. After we found the Mininet Bufferbloat example  we determined that a prio qdisc (queueing discipline) sandwiched between a htb root qdisc, and netem qdiscs at the leaves of each priority class appears to function as desired. We also added custom tc filters to match the DSCP field of packets and assign them to the correct priority class, because the default mapping was much different.
Iperf: Iperf (v2.0.5) also acted as an obstacle in our setup, as it was not always properly setting ToS bits nor reporting consistent measurements from the client and server sides. As a solution, we instead used Iperf3, a separate project that aims to implement the functionality of Iperf from scratch with a more streamlined codebase. We found that we were able to produce more consistent results with Iperf3. We also had to find a way to parse JSON output, as it does not output the traditional CSV format, but found that python implements a very easy to use JSON parser.
Measuring Flow Completion Time: The RC3 kernel implements RC3 as an option that must be enabled for a socket. Any test requiring RC3 must enable the option after the socket is enabled. To measure flow completion times, we made a simple program called rcttest which measures FCT from user-space. Enabling RC3 mode is specified using a command line argument. We had to apply adjustments to match the results of Figure 15, as discussed in the results section.
Flow Rates: As discussed previously, the tests we wished to emulate were run on 10Gbps line rate hardware. We chose to scale down rate and scale up line delay, effectively performing scaling in the time domain. This allowed us to perform the test and capture the major behavioral characteristics, although we could not observe phenomena which did not scale, such as kernel behaviors at the system and protocol levels.
We specifically chose an experiment that could be implemented in Mininet, both because it was encouraged for the project, and because it is a relatively straightforward platform. The RC3 authors already had existing Linux driver code, so using modified Linux was both correct and feasible. We made the choice of using Amazon EC2 because it could host the custom Linux kernel in a virtual machine, and reproducing the results would not require installing any software. The course-provided EC2 AMI, CS244-Spr15-Mininet (ami-cba48cfb) was used as a base since it was known to have a working Mininet/Linux stack. Iperf was used, as was used in the paper, and we chose version 3 because it produced more-reasonable behavior with the Linux priority queuing discipline. Flow completion times were measured with our own custom program, fcttest, described previously.
We were faced with the choice of porting Linux kernel changes between kernel versions, or trying to find a working image with the right base kernel version, and finding a mininet version that would work with that kernel. We chose the former. It’s not clear if that was the correct choice – the system does get in states where Mininet cannot restart without a reboot, and this could be due to issues porting the kernel patch, a bad implementation of RC3, or it could be a problem with Mininet or other software in the stack.
Generally the results are very reproducible for our parameters chosen. As mentioned before, higher line rates, or longer delays tend to produce strange behavior. We conjecture these are due to system performance limitations, and implementation features whose behaviors do not scale in time with our line rate and delay adjustments. Also of note, using EC2 instances with more cores does not seem to to make higher line rates usable. We are are unsure if that is a system limitation, or if there is a problem with our methods. “c3large” instances appear to work equally well or better than any other option. We also occasionally suffer from Mininet hanging on test initialization that require reboots, and erroneous TCP behavior. Clearly there are bugs in the software stack, but generally results tend to be obviously wrong (or non-existant, if the system hangs), or very near equal to the results presented here, with success occurring more often than failure. These issues do not appear to be a problem with the RC3 protocol.
Running the Experiment
- Launch an Amazon EC2 instance, using public AMI CS244-Spr15-RC3-Mininet-Experiement-v3 (ami-1b93ac2b). You will need to configure your security group to open inbound traffic on the standard SSH port, 22, and may wish to also open port 8000 to view results files using a web server interface.
- SSH into the instance using the public DNS URL and the username “ubuntu”.
- Run either the “short” test, or the “full” test, by running either “./run_short.sh” or “./run_full.sh”. The difference is the number of iterations each FCT test is run for each protocol and flow size. “Short” is recommended, performing 3 flows per test, whereas “full” performs 10 tests which takes considerably longer. (Note that terminal print messages should be spaced at most 2 minutes.)
- Once mininet shuts down and display the “Done” message, the results are ready in a subdirectory called “results-short-<timestamp>” or “results-full-<timestamp>”.
- You can view the plot files, with names of format “figure-##.png”, in many ways, including:
- Starting the python web server using “python -m SimpleHTTPServer &”. You can now browse the contents of the folder via web address “http://serverip:8000”, where serverip is the ip address of your server. This requires having port 8000 open for inbound traffic on your EC2 Security Group.
- Calling “feh image-##.png”, where “image-##.png” is the name of the image file. This uses the feh image viewer to open the image. It requires that you are connected with a terminal that supports X11 forwarding and that your ssh connection has trusted X11 forwarding enabled (ssh -Y).
- File transfer methods, such as SCP, to transfer the file to your local machine.
The source code for our experiment is available at:
Mininet test bench: https://github.com/JudsonWilson/CS244_RC3_Mininet
RC3 kernel changes: https://github.com/JudsonWilson/CS244_RC3_Kernel
Original RC3 kernel: https://github.com/NetSys/rc3-linux
 Mittal, Radhika, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. “Recursively cautious congestion control.” In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, pp. 373385. USENIX Association, 2014.
 M. Allman, S. Floyd, and C. Partridge. “Increasing TCP’s Initial Window.” RFC 3390
 N. Dukkipati, T. Refice, Y. Cheng, J. Chu, T. Herbert, A. Agarwal, A. Jain, and N. Sutin. “An Argument for Increasing TCP’s Initial Congestion Window.” ACM SIGCOMM Computer Communication Review, 2010.
 N. Dukkipati and N. McKeown. “Why Flow Completion Time is the Right Metric for Congestion Control.” ACM SIGCOMM Computer Communication Review, 2006.
 S. Floyd, M. Allman, A. Jain, and P. Sarolahti. “QuickStart for TCP and IP.” RFC 4782.
 Huang, Te-Yuan. “Bufferbloat.” Github – Mininet/mininet. N.p., n.d. Web. 30 May 2015. https://github.com/mininet/mininet/wiki/Bufferbloat
 Huang, Te-Yuan. Bitbucket – Huangty/cs144_bufferbloat. N.p., n.d. Web. 30 May 2015. https://bitbucket.org/huangty/cs144_bufferbloat