Project Name: Why Flow-Completion Time is the Right Metric for Congestion Control
Key Result(s): For TCP flows, average flow duration levels off quickly for flow sizes of >2000 packets.
Source(s): Original Paper: Why Flow-Completion Time is the Right Metric for Congestion Control. Nandita Dukkipati, Nick McKeown. ACM CCR January 2006.
The Flow-Completion time paper argues that a vast majority of internet applications care more about their flows’ completion times than instantaneous bandwidth, round-trip-time, or network utilization. But since bandwidth and network utilization are easier to tackle analytically, most papers use those as metrics, despite the fact that it’s not quite the right problem to be solving. Most notably, TCP’s slow-start and AIMD cause short flows to take far, far longer than they should to complete, while simultaneously stealing bandwidth from the flows that need it the most.
We attempted to produce a graph showing the FCT for differing numbers and lengths of flows, and the resulting link utilization. The results should be similar to the TCP results for Figure 2 in the paper. We did not attempt to evaluate XCP, since an implementation was not readily available.
We started with a basic star topology, with many client hosts creating flows to a single server host.
In converting the original paper’s simulation into a virtual mininet experiment, we had to produce a sender and receiver that created the same basic traffic patterns as the simulation. The key characteristic of the simulation was the Pareto-distributed flow lengths. We generate these using the Mersenne Twister random number generator (seeded by the high-resolution timer and the process ID) to generate uniformly distributed random values in the range [0, 1). We then use the inverse CDF of the Pareto distribution (with chosen mean flow length and “shape”) to choose the flow length in packets.
Optimizing Traffic Generation
The need for Pareto-distributed traffic along with the need for precise timing of each flow and flow counts eliminated the possibility of using iperf as the traffic generator. Instead, we opted to roll our own. At first we tried a many-to-one sender-receiver pair, with each sender being responsible for one concurrent flow, and a few receivers sinking as many flows as they could handle, but the per-process overhead of the senders clearly limited the system. We ended up instead with a combined server/client executable, each handling up to 1024 flows (depending on the system). In practice, we keep the number of sending flows to around 10 flows per one client process per host in the network. Data in the flow travels from client to server in order to facilitate more accurate flow completion time and bandwidth measurements. This also enables greater simplicity on the server side, which handles a far greater number of flows per process. In both the server and client mode, multiple flows are multiplexed on a single thread per process using asynchronous sockets and non-blocking syscalls. The resulting traffic generator and receiver is just under 400 lines of C.
Timing of flows begins when the client attempts a connection, and ends when the client and server have exchanged FIN packets. When a flow ends, the client outputs the length in packets and the time it took for the flow to complete, and then immediately starts a new flow. Thus each client tries to maintain the specified number of concurrent flows at all times. The server meanwhile waits for new connections, tallying and draining the data as it appears. Whenever a connection arrives or finishes, the server prints out a timestamp with the current number of flows. It also prints out the amount of data received once per second to keep track of goodput.
Running the experiment
Running the experiment is a simple matter of creating the star topology in a Mininet-enabled Python script, and creating sufficient instances of our traffic generator with the proper parameters, killing the generators after a chosen amount of time. The resulting data is easy to parse and use to generate graphs. As we were hitting flow registration and expiration bottlenecks in Open vSwitch that Mininet uses (discussed further below), we also polled once a second for CPU usage and the number of active Open vSwitch flows, generating graphs of those as well.
While we gathered results for a number of configurations, varying the total number of active flows, number of hosts, and mean flow length, our most representative results were obtained using 500 active flows spread across 50 hosts, with a 100 Mbps link bandwidth.
The round-trip time was 100ms, and flow sizes are Pareto distributed with a mean of 60 packets and shape = 1.4, where 1 packet = 1000 bytes. These parameters match the specifications in the original paper except for the mean flow length of 60 packets, which we increased in order to avoid stressing the bottleneck of creating and destroying flows in Open vSwitch.
This graph shows the relationship between flow completion time (FCT) and flow size in packets and is the most important result of our experiment. The graph is similar in shape to that of the original paper, but with two key differences. First, our flow durations are much longer–the flow completion times in the original paper are rarely greater than 10 seconds, even for large flows. This is to be expected, as due to scaling issues (described later) we used a link bandwidth that is less than 1/10th that used in the original paper. Second, we found the relationship between flow duration and flow size to be much closer to linear (note: the graph is drawn using a log scale), whereas flow completion time in the original paper was relatively constant for flow sizes greater than 2000 packets.
There are multiple possible reasons for this behavior. For instance, the effect of slow start on our small flows may have been less pronounced due to slow start being skipped by our TCP implementation. Another possibility is that congestion at our switch may have prevented longer flows from reaching the window sizes they would otherwise be able to.
Here we graphed the number of active flows over the course of our experiment, measured at the server. Our graph is slightly different than the one in the original paper, which seems to have greater variability. This is probably an artifact of whatever method the authors used to generate their Pareto distributed flow traffic (the details were not given in the paper). Note that we deliberately chose to limit the number of active flows at any given point as to avoid overloading the Open vSwitch.
Our link utilization is slightly lower than that of the original paper, since we chose to graph goodput, not throughput as in the original paper’s graph. The CPU overhead of setting up and tearing down flows as well as packet drops under heavy load also account for the slightly decreased link utilization. The sharp drops in utilization probably correspond to points where the switch struggled to handle all of the simultaneous flows being directed through it.
In addition to the above three graphs, which correspond to Figure 2 in the original paper, we graphed the CPU usage of the system over time, as well as the number of active flows as reported by ovs-dpctl show.
Our CPU utilization averaged around 25%. There are 4 cores on our system, so this suggests that we may be limited by single-threaded processing of the switch and thus indicating that we are starting to run into CPU limitations at this number of flows. The downward spikes in this graph correlate with the link utilization hiccups described earlier.
We graphed the state of the Open vSwitch using ovs-dpctl show to ensure that it was handling all of our flows without trouble. This graph is relatively stable, indicating that the switch was not breaking down under the load of the flows we were throwing at it. However, note that the number of active flows in Open vSwitch is almost twice that detected at the server, probably due to the TCP TIME_WAIT states. This implies that the speed at which we create flows impacts the ability of Open vSwitch to cache flows more significantly than just the number of simultaneous flows. Also, there are some correlated hiccups in the time-based graphs, suggesting that we’re hitting the peak of what Open vSwitch can handle under the given conditions.
There were several scaling issues we ran into as we began to run our experiments. The first bottleneck we hit was with per-process overhead, when we were still running one sender process for each active flow. After reworking our traffic generator to handle multiple flows per process, we found that we were still CPU-limited and were not anywhere near saturating the bandwidth of our links. In addition, the original paper specifies a link capacity of 2.4 Gbps, but Mininet cannot accurately handle a link of this bandwidth. We thus had to scale down the experiment accordingly to avoid overwhelming the switch in our topology.
Instructions to Replicate This Experiment:
This is easy to run with an Ubuntu 12.04 EC2 c1.xlarge instance with mininet installed (note: Ubuntu 11.10 will work as well, but the open-vswitch version is older, so flow count monitoring will be disabled). Clone the git repository at git://github.com/dnschneid/flowcomptime.git and then run ./flowcomp-sweep.sh. The experiment will take about 2 hours to complete, after which the results will be available as graphs in a subdirectory.