Why Flow-Completion Time is the Right Metric for Congestion Control

Project Name: Why Flow-Completion Time is the Right Metric for Congestion Control

Team Members: David Schneider (dns@stanford.edu), Timothy Wong (ddrkirby@stanford.edu)

Key Result(s): For TCP flows, average flow duration levels off quickly for flow sizes of >2000 packets.

Source(s): Original Paper: Why Flow-Completion Time is the Right Metric for Congestion Control. Nandita Dukkipati, Nick McKeown. ACM CCR January 2006.

The Flow-Completion time paper argues that a vast majority of internet applications care more about their flows’ completion times than instantaneous bandwidth, round-trip-time, or network utilization. But since bandwidth and network utilization are easier to tackle analytically, most papers use those as metrics, despite the fact that it’s not quite the right problem to be solving. Most notably, TCP’s slow-start and AIMD cause short flows to take far, far longer than they should to complete, while simultaneously stealing bandwidth from the flows that need it the most.

We attempted to produce a graph showing the FCT for differing numbers and lengths of flows, and the resulting link utilization. The results should be similar to the TCP results for Figure 2 in the paper.  We did not attempt to evaluate XCP, since an implementation was not readily available.

We started with a basic star topology, with many client hosts creating flows to a single server host.

Mininet topology

Pareto Traffic
In converting the original paper’s simulation into a virtual mininet experiment, we had to produce a sender and receiver that created the same basic traffic patterns as the simulation.  The key characteristic of the simulation was the Pareto-distributed flow lengths.  We generate these using the Mersenne Twister random number generator (seeded by the high-resolution timer and the process ID) to generate uniformly distributed random values in the range [0, 1).  We then use the inverse CDF of the Pareto distribution (with chosen mean flow length and “shape”) to choose the flow length in packets.

Optimizing Traffic Generation
The need for Pareto-distributed traffic along with the need for precise timing of each flow and flow counts eliminated the possibility of using iperf as the traffic generator.  Instead, we opted to roll our own.  At first we tried a many-to-one sender-receiver pair, with each sender being responsible for one concurrent flow, and a few receivers sinking as many flows as they could handle, but the per-process overhead of the senders clearly limited the system.  We ended up instead with a combined server/client executable, each handling up to 1024 flows (depending on the system).  In practice, we keep the number of sending flows to around 10 flows per one client process per host in the network.  Data in the flow travels from client to server in order to facilitate more accurate flow completion time and bandwidth measurements.  This also enables greater simplicity on the server side, which handles a far greater number of flows per process.  In both the server and client mode, multiple flows are multiplexed on a single thread per process using asynchronous sockets and non-blocking syscalls.  The resulting traffic generator and receiver is just under 400 lines of C.

Flow Logging
Timing of flows begins when the client attempts a connection, and ends when the client and server have exchanged FIN packets.  When a flow ends, the client outputs the length in packets and the time it took for the flow to complete, and then immediately starts a new flow.  Thus each client tries to maintain the specified number of concurrent flows at all times.  The server meanwhile waits for new connections, tallying and draining the data as it appears.  Whenever a connection arrives or finishes, the server prints out a timestamp with the current number of flows.  It also prints out the amount of data received once per second to keep track of goodput.

Running the experiment
Running the experiment is a simple matter of creating the star topology in a Mininet-enabled Python script, and creating sufficient instances of our traffic generator with the proper parameters, killing the generators after a chosen amount of time.  The resulting data is easy to parse and use to generate graphs.  As we were hitting flow registration and expiration bottlenecks in Open vSwitch that Mininet uses (discussed further below), we also polled once a second for CPU usage and the number of active Open vSwitch flows, generating graphs of those as well.

While we gathered results for a number of configurations, varying the total number of active flows, number of hosts, and mean flow length, our most representative results were obtained using 500 active flows spread across 50 hosts, with a 100 Mbps link bandwidth.

The round-trip time was 100ms, and flow sizes are Pareto distributed with a mean of 60 packets and shape = 1.4, where 1 packet = 1000 bytes.  These parameters match the specifications in the original paper except for the mean flow length of 60 packets, which we increased in order to avoid stressing the bottleneck of creating and destroying flows in Open vSwitch.

Flow duration vs Flow size – Mininet simulation

Flow duration vs Flow size – Original result

This graph shows the relationship between flow completion time (FCT) and flow size in packets and is the most important result of our experiment.  The graph is similar in shape to that of the original paper, but with two key differences.  First, our flow durations are much longer–the flow completion times in the original paper are rarely greater than 10 seconds, even for large flows.  This is to be expected, as due to scaling issues (described later) we used a link bandwidth that is less than 1/10th that used in the original paper.  Second, we found the relationship between flow duration and flow size to be much closer to linear (note: the graph is drawn using a log scale), whereas flow completion time in the original paper was relatively constant for flow sizes greater than 2000 packets.
There are multiple possible reasons for this behavior.  For instance, the effect of slow start on our small flows may have been less pronounced due to slow start being skipped by our TCP implementation.  Another possibility is that congestion at our switch may have prevented longer flows from reaching the window sizes they would otherwise be able to.

  Active flows vs Time – Mininet simulation

Active flows vs Time – Original result

Here we graphed the number of active flows over the course of our experiment, measured at the server.  Our graph is slightly different than the one in the original paper, which seems to have greater variability.  This is probably an artifact of whatever method the authors used to generate their Pareto distributed flow traffic (the details were not given in the paper).  Note that we deliberately chose to limit the number of active flows at any given point as to avoid overloading the Open vSwitch.

  Link utilization vs Time – Mininet simulation

Link utilization vs Time – Original Result

Our link utilization is slightly lower than that of the original paper, since we chose to graph goodput, not throughput as in the original paper’s graph.  The CPU overhead of setting up and tearing down flows as well as packet drops under heavy load also account for the slightly decreased link utilization.  The sharp drops in utilization probably correspond to points where the switch struggled to handle all of the simultaneous flows being directed through it.

In addition to the above three graphs, which correspond to Figure 2 in the original paper, we graphed the CPU usage of the system over time, as well as the number of active flows as reported by ovs-dpctl show.

CPU usage vs Time – Mininet simulation

Our CPU utilization  averaged around 25%.  There are 4 cores on our system, so this suggests that we may be limited by single-threaded processing of the switch and thus indicating that we are starting to run into CPU limitations at this number of flows.  The downward spikes in this graph correlate with the link utilization hiccups described earlier.

Open vSwitch active flows vs Time – Mininet simulation

We graphed the state of the Open vSwitch using ovs-dpctl show to ensure that it was handling all of our flows without trouble.  This graph is relatively stable, indicating that the switch was not breaking down under the load of the flows we were throwing at it.  However, note that the number of active flows in Open vSwitch is almost twice that detected at the server, probably due to the TCP TIME_WAIT states.  This implies that the speed at which we create flows impacts the ability of Open vSwitch to cache flows more significantly than just the number of simultaneous flows.  Also, there are some correlated hiccups in the time-based graphs, suggesting that we’re hitting the peak of what Open vSwitch can handle under the given conditions.

Lessons Learned:
There were several scaling issues we ran into as we began to run our experiments.  The first bottleneck we hit was with per-process overhead, when we were still running one sender process for each active flow.  After reworking our traffic generator to handle multiple flows per process, we found that we were still CPU-limited and were not anywhere near saturating the bandwidth of our links.  In addition, the original paper specifies a link capacity of 2.4 Gbps, but Mininet cannot accurately handle a link of this bandwidth.  We thus had to scale down the experiment accordingly to avoid overwhelming the switch in our topology.

Instructions to Replicate This Experiment:
This is easy to run with an Ubuntu 12.04 EC2 c1.xlarge instance with mininet installed (note: Ubuntu 11.10 will work as well, but the open-vswitch version is older, so flow count monitoring will be disabled).  Clone the git repository at git://github.com/dnschneid/flowcomptime.git and then run ./flowcomp-sweep.sh.  The experiment will take about 2 hours to complete, after which the results will be available as graphs in a subdirectory.

2 responses to “Why Flow-Completion Time is the Right Metric for Congestion Control

  1. We were able to complete the experiment smoothly, without additional instructions. The setup didn’t take more than a few minutes and as stated the script took about 2 hours to complete.

    However, we did notice some differences in the results. Our plots for the 500 flows run yielded the results here:
    Our plots in *.png look somewhat similar to the results shown but there are distinct differences that we’re not sure about.

  2. We realized after our first run that we had run the tests on an instance with MPTCP enabled. This significantly skewed the results and was clearly a mistake on our part. After re-running the results on a fresh Ubuntu 12.04 instance with Mininet installed, we found similar results as above. Particularly our flow duration looks very similar: stanford.edu/~joshuav/500f-50h-60b-3600s/duration.png

    The setup, as Maxine said, was very straightforward and yielded consistent results with David and Tim’s data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s