Why Flow-Completion Time is the Right Metric for Congestion Control (Rate Control Protocol)
Philip Zhuang – email@example.com
Victor Kaiser-Pendergrast – firstname.lastname@example.org
Network research in packet switched networks as almost unilaterally focused on optimizing throughput, for a particular definition of fairness, round trip times, and so forth. One metric that is particularly important to end users today is how quickly their flows complete. In today’s internet, many (if not most) flows are only a handful of packets, completing in just a few round trips. Unfortunately, today’s transmission rate control mechanisms are overly conservative, delaying the completion times of what should be quite short-lived flows.
RCP seeks to minimize Flow Completion Time. Today’s traffic on the internet includes many small flows, whose transfers complete before TCP even exits its initial slow start. TCP and XCP were designed to maximize a flow’s throughput in the long run, and likewise, much of network research is dedicated to optimizing network utilization, flow throughput, and fairness enforcement among competing traffic.
RCP, a new protocol, that emulates processor sharing at switches can achieve fair sharing of the network, while minimizing flow completion times, which is especially beneficial to short-lived flows. RCP shows incredible benefit to short-lived flows, whose time to completion is most influential on a typical user’s quality of experience.
TCP and XCP converge to fair-share allocations, but only after many round trip times; TCP and XCP react too slowly to complete short transfers promptly. When compared to the processor sharing scheme that TCP and XCP roughly emulate, one can see that processor sharing’s average flow duration would be significantly (one or two orders of magnitude) shorter.
Furthermore, both TCP and XCP allow some flows to unfairly utilize bottleneck links; this is a product of TCP’s propensity to favor low RTT flows, as well as its slow convergence speed (i.e. only long-lived flows converge to their proper share of the network, whereas short flows never reach their fair allocation). XCP more explicitly prioritizes longer-lived flows, as XCP by design will gradually reduce throughput of older flows when new flows arrive.
Finally, TCP by design tries to fill buffers at bottlenecks in order to maximize throughput, which also maximizes RTT, by increasing queueing delay. Flows that start when buffers are already maximized measure a quite variable RTT, increasing their timeout durations.
From these identified problem sources, one can conclude that simple modifications to existing congestion control schemes will not fix the above problems. Rather, these issues appear to be intrinsic to TCP’s and XCP’s design.
Results in the Paper
By explicitly emulating processor sharing in routers, and having a router explicitly assign an identical rate to its flows, RCP can come close to the ideal processor sharing scheme in terms of average flow completion time. This is one or two orders of magnitude better than TCP or XCP’s resulting average flow completion times. The author’s note that RCP is essentially XCP, but with a simpler rate assignment procedure.
Processor sharing, in its most fundamental form, equates to a worker servicing incoming jobs at the same rate. When applied to networks, one could imagine a network switch that services incoming flows round robin, one bit at a time. RCP seeks to emulate this by assigning rates to flows, which more or less approximates processor sharing. As shown in the paper’s results, RCP’s flow completion times are closely distributed around ideal processor sharing’s completion times.
The authors show, through network emulation of many flows differing in length, that RCP flow completion times average to roughly that of processor sharing, whereas TCP and XCP prolong flow duration to significantly longer than necessary. The original results demonstrate that RCP is quite close to the ideal processor sharing scheme, which approaches arbitrarily close to minimizing average flow completion time.
In particular, our goal is to replicate Figure (6), which plots average flow completion time for Tcp, XCP, and RCP (as well as the ideal processor sharing scheme, as a baseline):
Though not specified in the paper, from the supplied tests, it appears that this network was setup as just two clients connected through a single router. The source code of the implementation supports this claim.
Figure (6) was chosen to reproduce, because it is essentially the only result demonstrating RCP in the paper. Figure (6) uses an unspecified number of flows, each of which consists of a number of packets drawn from a pareto distribution. The example test files provided with the RCP implementation specifies , , and values for the pareto distribution, but it is not clear if these align with the results shown in Figure (6). One discrepancy to note is that the paper claims the pareto distribution used a shape factor of 1.2, whereas the test shows 1.8. There is a similar difference in the mean of the pareto distribution, where the paper specifies an expectation of 25, but the test source set expected value to 30.
We emailed the author, Nandita Dukkipati, asking for clarification regarding what values were used to produce the result in the paper, but did not receive a response. Nonetheless, our tests use the 1.2 shape factor and mean 25 packet average flow length specified in the paper, given that we are trying to reproduce the exact results in the paper, not searching for a pareto distribution of flow sizes that matches best to the original author’s results.
Our testing in ns-2 matches the original results presented in Figure (6). In particular, RCP closely approximates processor sharing, although our results for RCP do not correlate with the ideal Processor Sharing scheme as well. This is likely due to the disparity among the paper’s pareto distribution parameters, and those presented in the source. With some fiddling of the distribution arguments, odds are that the original figure could be reproduced more precisely.
Reproduction with Existing ns-2 Implementation
The given implementation of RCP was built in the network simulator ns-2. The RCP implementation’s source specified ns-2 version 2.32, but we instead chose to use the more modern version 2.35, which includes more suitable TCP emulation. Neither versions of ns-2 compile with GCC 5.3.1, nor GCC 4.4, recommended by ns-2’s README. After fixing all compilations issues with ns-2 version 2.35 (specified in the RCP’s implementation), the RCP implementation was set up in the ns-2 build directory as specified (although the README for RCP had inadequate and out of date instructions). The RCP implementation itself did not compile, so it had to be patched up as well.
After cleaning up compilation issues, we slightly modified the RCP and TCP tests provided by the authors, and wrote a quick script to build charts of the appropriate scales, with the appropriate reference lines for TCP slow start and Processor Sharing.
Note that average flow completion time for a flow of a particular size can sometimes dip below the “ideal” processor sharing scheme in both the original results and our own results. This should not be possible. However, Nandida states (in a different paper) that RCP does employ a fast-open scheme (where data is included in the first packet), which potentially removes an RTT from the flow’s total duration, which is especially impactful on shorter RTT flows. The implementation of TCP in ns-2 surely does not use fast open, so this comparison of TCP and RCP is not ideal. By using a fast link (2.4Gbps) and a comparatively longer RTT (100ms), the results produced in the paper’s Figure (6) look especially good.
Extension: Reproduction in Mahimahi
We have also been porting the ns-2 implementation of the in-network logic as a drop tail queue in Mahimahi. This avoids all issues with ns-2’s emulation of TCP, which probably skews the results in RCP’s favor. The TCP implementation built into ns-2 is suspicious at best (it certainly hasn’t had the decades of tuning that one would find in any modern Linux kernel), and its performance historically hasn’t matched what one would expect of a production implementation of TCP.
We tried to create the simplest RCP topology in Mahimahi, which includes two end-host and a modified drop-tail queue as the router between two end-host. The end-host is responsible for sending packets by the rate in the RCP header it receives from ACKs. The router will monitor the flow rate and adjust the rate according to formula (1).
The end-host is implemented based on the datagrump sender in Program Assignment 2 to simplify our work. The logic of our new code is based on the ns-2 rpc-host.cc code, but because some libraries rpc-host.cc uses are not immediately available in mahimahi and we found that many of the lines in rpc-host.cc are redundant, the detailed implementation is very different. Here are some summaries of the major change in the end-host.
- Message header:
The new message header is based on the contest message header. We added a few new fields:
uint64_t RCP_pkt_type; uint64_t rtt; /* In nanosecond */ int64_t RCP_rate; /* In bytes per second */
Note that we changed all timestamp to use nanosecond instead of millisecond to increase resolution.
- We removed everything related to the REF mode for now. We are not very clear what this mode is doing and don’t think we need it.
- Instead of using a timer to send packets at the required rate, we dispatch a “senderloop” thread to send packets for every interval_. This simplifies the logic. The sendfile/sendpkt/sendlast logic are all mapped to the senderloop. send_datagram() now takes the argument of RCP pkt type and is used to create a packet and send.
- recv() function is mapped to rcp_recv() in the sender and transform_into_ack() in the receiver.
Our end-host is working as expected even without RCP router. We can manually set RCP_DESIRED_RATE and the flow will send at a rate very close to the desired rate, especially with large packet size, like 1500 byte. For small packet size like 100 byte, we think that the I/O rate becomes high and the computation overhead becomes more obvious so the rate is lower than the desired rate.
For the router component, we added a new queue type, “rcp”, which is based on dropping_packet_queue.hh. We added a few parameters to DroppingPacketQueue to support calculation in formula (1). In the enqueue() function, the router will read RTT from the RCP header and calculate the average RTT. Also, the router will call checkStat() function at each time slot to recalculate the flow rate. Finally, the router will replace the RCP flow rate in the header with the updated one. To use “rcp” queue type, an extra argument called “capacity” is needed to support formula (1) calculation. This stands for the link capacity we are sending between two hosts.
Also, we think there is a bug in the original code that uses (link_capacity_ – input_traffic_) in the formula calculation. According to the formula, it should be input traffic rate that is subtracted, not input traffic. These two variables don’t even have the same unit. So, in our new code, we replaced input_traffic_ with last_load_.
However, we had experienced serious problem in getting the router part working. We found that when we modified the flow rate in the header, the receiver will drop the packet. We suspected that there were some high level checksum and what we have done have “corrupted” the packet. Our router needs to recalculate the checksum to prevent this from happening.
As mentioned earlier, the inconsistencies in distribution parameters are worrying, and our results for RCP are not quite as good as those presented in the paper’s Figure (6). Nonetheless, the paper’s assertion that RCP is a better minimizer of flow completion time appears valid. The paper presents a meaningful assertion that flow completion time should be minimized given current internet traffic and users’ expectations.
The results presented have no justification of their usage of a pareto distribution for flow sizes. It would have been more satisfactory if there was some data collection or citation of a statistical sample of flow sizes in the internet, to give some grounding in the decision to model traffic patterns with a pareto distribution.
When we inspected the RCP implementation, we also had some concerns. There are some computational overhead on both end-hosts and routers. The end-hosts and routers need to keep track of some statistics. If we use high bandwidth network like 10Gb/s and all such implementations are in software, it will end up with high CPU utilization, perhaps failing to update the statistics quickly enough. In such case, we think it’s better to have hardware to help with statistics collecting and computing and update the header accordingly. However, this will add cost to the hardware. Also, in order for RCP to work, we need all routers to support the protocol, which is an impediment to rollout.
Our results are from the original ns-2 implementation of RCP, albeit slightly modified. For ease of development and testing, we chose to use Vagrant, which can easily and reliably spin up a virtual machine as a clean environment for testing.
We containerize the entire setup, compilation, and test running procedure in a virtual machine, which is easily managed by Vagrant, which is designed for rapid iteration using the clean environment of a new virtual machine on each run. Vagrant is easily installed on Linux, Mac, and Windows, so our results are reproducible easily on virtually any mainstream computer. Because we are distributing a few files and scripts instead of a whole virtual machine image, downloading necessary files, and getting up and running is a quick affair.
Note that the use of virtual machines bars running our tests on a cloud provider like AWS, because virtual machines running on virtual machines simply breaks down. Because our Vagrant-managed virtual machine targets a specific version of Ubuntu and associated dependencies, we can rest assured that anyone attempting to reproduce results far in the future should have no compatibility issues.
Reproduction is intended for Linux or Mac OS only.
- git clone https://github.com/victorkp/rcp-ns2-result-reproduction
- If you are on Mac OS, or a non-Debian, non-RedHat based Linux, install Vagrant and VirtualBox. If you are using Debian/Ubuntu, or RedHat/Fedora/CentOS, run-all.sh should handle installation of Vagrant and VirtualBox for you.
- Run ./run-all.sh
- This spins up a virtual machine using Vagrant, installs all required packages in the VM, downloads, patches, and compiles ns-2.35 with RCP, runs tests, and generates charts from those results. All outputted traces and charts are seamlessly synced back to the host operating system, thanks to Vagrant.
- Note that this will take roughly 40 minutes, although this depends on you computer’s network throughput and compute performance. A first-run will need to download a Ubuntu VM image, which may add an additional five minutes, depending on one’s network capacity.
Note for those reproducing on computers with under 6GB of memory:
The current Vagrantfile setup tries to use 4GB of RAM for the virtual machine. Consider reducing memory usage in the Vagrantfile. See the “config.vm.customize” attribute.
The porting to mahimahi work is not complete. We hope in future we can finish the router code so that it can change the header without corrupting the packet. Once we get the RCP work with mahimahi, we can evaluates RCP on mahimahi and compared it to RCP. We can also evaluate the performance with multiple routers.
Also, current RCP implementation doesn’t have congestion control. If the rate falls to 0, the end-host falls into a CONGEST state, from which there is no mechanism to recover, stopping all end-hosts indefinitely. There would need to be a mechanism for a router to let an end-host know that its flow can resume transmission.
[RCP] N. Dukkipati, N. McKeown, “Why Flow-Completion Time is the Right Metric for Congestion Control”, In ACM SIGCOMM Computer Communication Review, Volume 36, Number 1, January 2006.
[Processor Sharing] N. Dukkipati, et. al, “Processor Sharing Flows in the Internet”, Proceedings of 13th Int. Workshop, IWQoS 2005, June 2005.