Life’s not fair, neither is TCP (…under the following conditions)


Team: Harshit Chopra and Benjamin Helsley.

Key Result: In a data center network, if many TCP flows arrive at two input ports on a tail-drop switch and compete for one output port, the group with fewer flows will have vastly degraded throughput.

Sources:

  1. The TCP Outcast Problem: Exposing Unfairness in Data Center Networks.  Pawan Prakash, Advait Dixit, Y. Charlie Hu, and Ramana Kompella.  NSDI 2012
  2. NSDI Presentation
  3. Reference on TBF queueing discipline: http://www.opalsoft.net/qos/DS-24.htm
  4. For Figure 6 below, we use the modified iperf supplied with PA2: http://www.stanford.edu/class/cs244/2012/pa2.html

Contacts: Harshit Chopra (harshitc@stanford.edu), Benjamin Helsley (bhelsley@stanford.edu)

Introduction

Prakash et. al. observe that if many TCP flows arrive at two input ports on a commodity off-the-shelf switch competing for the same output port, the lesser group of flows will have vastly degraded throughput. They call such an effect TCP Outcast to signify “outcasting” of flows on one input port by flows on another input port. Another condition for TCP outcast to occur is the use of commodity switches, which have shallow packet buffers and employ taildrop (or variants of taildrop) queue management policy.

Figure 1: illustrating the “port blackout” effect, this is Figure 9 from [1]

The authors use traces and simulation to identify the cause as a phase effect “port blackout”: where one input port synchronizes with the output port and causes a sequence of consecutive drops on the other port (Figure 1). The smaller group of flows each shoulder a larger share of drops in the sequence, increasing the chance that the entire tail of the TCP window for a single flow is dropped, causing timeouts. They find that increasing the number of flows does not mitigate the problem, and suggest that the number of observed consecutive drops increases with the number of flows. The authors present a number of ways to attenuate the unfairness, both locally at the switch by changing the switch queueing away from Taildrop, or globally by enforcing an “equal length” routing which evens out the distribution of flows between ports.

In this project, we aim to reproduce the TCP outcast behavior on Mininet both on a simplified and simulated FatTree topology, and find evidence of “port blackout”, as specified in the paper in sections 3.2.1-3.2.4.

Methods

The experimental results from the paper use a data center testbed configured in a k=4 fat-tree topology, with 16 hosts at the leaf and 20 servers acting as 4-port gigabit switches with 16KB tail-drop queued packet buffers per port.  To reduce the effects of TCP Incast, minRTO on the hosts is configured to 2ms.  All other TCP parameters are left at their defaults.

The authors use a fat-tree topology to illustrate how the conditions for TCP Outcast could occur in practice from a many-to-one traffic pattern, yet the results show that the only requirements are switches that use a taildrop queuing discipline and an imbalance in the number of flows on each incoming port.  As such we started with a contrived minimal topology consisting of only two switches (one acting as an aggregator for the larger group of flows), then later emulated a k=4 fat-tree topology as used in the paper.  In both cases, we mimic the environment in the original paper using the following configuration:

hosts: minRTO is set to 2ms using the “ip route replace …” tool.  For comparison against results presented in the paper, all traffic is TCP BIC flows generated by iperf.

links: The “port blackout” effect illustrated above is sensitive to the queuing discipline of the switch, and the burst behavior of the links.  If the burst-rate of ports A and B differ, they will not synchronize and prevent us from observing the effect.  Mininet emulates links using linux “tc qdisc” to set the queuing discipline and rate-limiting.  To emulate the taildrop queuing necessary for TCP Outcast, we experiment with both “token bucket filter” (tbf) [3] and “hierarchical token bucket” (htb) queuing disciplines, the latter being Mininet’s default.  Our experiments use a 100Mbps linkspeed, down from 1Gbps in the paper which we were not able to reliably emulate on Mininet.  Although we experimented with both TBF and HTB, we present results with TBF for the simple topology and with HTB for fat tree (explanation of this choice is in “Lessons Learned” section).

routing: The paper uses shortest distance routing with fat tree and uses the last bits of the destination host address to select the higher level switches to break ties between multiple shortest routes. We emulate routing in mininet on fat tree using the same algorithm.

To measure per-flow throughput we collect unsampled TCP dumps at the bottleneck switch interfaces.  As in the paper, we join the TCP dumps from different ports to reconstruct the switch queue and investigate the “port blackout” phenomenon.

Results

Figure 2 shows the result from the original paper of the instantaneous throughput for 3, 7 and 13 flows.  Flow #1 is coming in port “A”, the rest of the flows ingress on port “B”, all flows content for the same output port “C”, illustrating that the one flow on port “A” does not get a fair share of bandwidth.  Bottom row shows the average throughput for each group, over the first 20%, 40% and full interval.

Figure 2: Results from original paper, Figure 3 in [1]

At large time scales, we are able to reproduce what appears to be the “Outcast” effect on both our contrived topology and the k=4 FatTree with results matching those presented in the paper:

Figure 3: Reproduction of results shown above on Mininet.  Here we use 100Mbps links and 128KB TBF switch buffers on a simplified topology.

Figure 4: Reproduction of results with fat tree topology with 1-2, 1-6 and 1-12 flows: 100Mbps links, 16KB (11 packet) HTB switch buffers, static shortest distance routing, 2ms minRTO. The data is for 20 seconds duration.

Figure 4 shows the corresponding graphs with fat tree topology with 1 2-hop and 2 6-hop flows, 1 2-hop and 6 6-hop flows and 1 2-hop and 12 6-hop flows respectively. We see consistent reproduction of graphs with a duration of 20 second or more. For a lesser duration, the variance increases, which probably is because the flows take a while to settle down and the initial bursts have considerable effect on the average throughput. This is evident in the top-left graph, which shows instantaneous throughput against time for 2 6-hop flows. The “outcasted” 2-hop flow initially hogs the bandwidth since it has lower RTT (~1ms vs. ~10ms) and is the first one to start (an artifact of our experimental setup), which requires multiple attempts from the rest of the flows to get going. And once they are successful in increasing their congestion window, we start seeing the outcast effect.

The paper also shows that the result scales to larger numbers of flows in the same proportion.  By using the modified iperf binary provided in PA2 [4], we are also able to show the result “scales up”. As the paper suggests, the outcast effect is visible even after increasing the number of 2-hop flows (keeping the ratio between number of 2-hop and 6-hop flows constant to 1:12).  However, we had some difficulty reproducing this result for 20-240, 30-360 flows on Mininet and chose not to reproduce them.

Figure 5: Average throughput per host with multiple 2-hop flows as presented in the paper (fixed proportion 1:12)

Figure 6: Average throughput per host for 2-24, 5-60 and 10-120 flows respectively on fat tree. The data is for 60 seconds duration.

Finally, our experiments show some evidence to support the “port blackout” hypothesis, in that we can find some traces with chains of consecutive drops on a particular input port.  However, our aggregate drop counts show that most of the time the flows experience drops in “singles” only, so the result is somewhat inconclusive.  The trace displayed below is taken from a run of the “contrived” topology, with 1 “2-hop” and 6 “6-hop” flows.

The following is a sample from one of our trace files, showing an instance where we see the s0-eth2 port drop 4 times in a row while s0-eth3 makes it through in between (both ports have flows competing to egress on s0-eth1):

src dest ingress ingress timestamp (ms) result egress egress timestamp (ms)
10.0.0.2:53035 10.0.0.1:5001 s0-eth2 82.843000 DROP
10.0.0.3:34287 10.0.0.1:5001 s0-eth3 82.871000 FOUND s0-eth1 93.369000
10.0.0.2:53035 10.0.0.1:5001 s0-eth2 82.942000 DROP
10.0.0.3:34287 10.0.0.1:5001 s0-eth3 82.986000 FOUND s0-eth1 93.474000
10.0.0.2:53035 10.0.0.1:5001 s0-eth2 83.064000 DROP
10.0.0.3:34287 10.0.0.1:5001 s0-eth3 83.104000 FOUND s0-eth1 93.591000
10.0.0.2:53035 10.0.0.1:5001 s0-eth2 83.174000 DROP
10.0.0.3:34287 10.0.0.1:5001 s0-eth3 83.297000 FOUND s0-eth1 93.705000

Lessons Learned

Overall, we found that the result could be reproduced using out of the box Mininet components. In this particular case, the most naive approach possible actually worked — by coincidence, the default link emulation and queuing discipline approximates a tail-drop queued switch.  However, we found it extremely difficult to get beyond the naive result — attempting reproduce at the 0.5s timescale used in the paper created a lot of strange artifacts that were time consuming to explain.

While we feel we were ultimately able to get plausible results on Mininet, an enormous time was spent chasing ghosts, tweaking parameters and observing results.  Our final results are especially sensitive to queuing discipline and buffer sizes.  For example, in the simplified topology we found that TBF queuing with a 128KB buffer gives nice clean results that closely mirror those in the paper.  Reducing the buffer size to 16KB (as used in the paper) causes the effect to disappear completely, with the “outcast” host getting a majority share of the bandwidth.  Conversely, if we keep the small queue but instead switch to HTB (using an 11 packet ~16KB buffer) we get overly severe outcasting effects with the Outcast getting virtually no throughput whatsoever!  Yet HTB with an 11 packet buffer experimentally gave the most consistent results when run on the fat-tree topology.  Obviously there’s something interesting going on here, and for some time we dug at this as it calls into question whether we are actually seeing Outcasts “for real” or just as a consequence of variability introduced by bad parameters.  Unfortunately we are not able to find a unifying explanation for the inconsistency, but had to be satisfied with seeing the effect at a high level and observing something like port-blackout occasionally in our traces.

Finally, we think that realistic emulation of data center topologies in mininet will be an interesting challenge.  In our case to use the TBF queue on the small scale we needed larger buffers than used in the paper.  Using larger buffers, and slower links to emulate a FatTree, well you’re going to have a bad time — in this case introducing queuing delays of > 80ms on the 6-hop flows and destroying the effect.  Because of the tight relationship between queue size, link speed and delay, without the ability to emulate high-speed links, the ratios here are going to be unrealistic.  As a consequence, effects which rely on timing issues or congestion behavior may be difficult to reproduce without confounding artifacts.

Steps to Reproduce

One time steps

  1. In EC2, choose region “US West (Oregon)”.
  2. Launch a fresh “cs244-mininet-mptcp-dctcp” c1.xlarge instance.
  3. Clone code from the repository: git clone git://github.com/bhelsley/stanford-cs244-tcp-outcast.git
  4. Run ‘cd stanford-cs244-tcp-outcast’
  5. Run ‘./INSTALL.sh’.

To reproduce simple topology graphs:

  1. Run ‘sudo ./run_simple_topo.sh’

The graphs are generated as:

  1. results_simple/result.tcpdump.png — the primary result showing the outcast effect.
  2. results_simple/result.blackout.png — histogram of frequency of different chains of successive drops by flow.

To reproduce fat tree topology graphs:

  1. Run ./run_openflow.sh
  2. When you see ‘Ready’ in step 1, in another terminal run ‘sudo ./run_fattree.sh’.

The two graphs are generated as results/final_small.tcpdump.png (Figure 4) and results/final_big.tcpdump.png (Figure 6).