Hedera


Team: Frank Li and Yiam Watanaprakornkul.

Key Result(s): We replicate to some degree the performance benefit of Hedera over ECMP as documented in [1], but in almost all tests this difference is not significant. In particular, we find that the ECMP implementation in [1] closely approximates the results expected of standard Spanning Tree, and that as the number of flows per host increase, the performance gain of Hedera decreases.

Source(s):

[1] M. Al-Fares et al. Hedera: Dynamic Flow Scheduling for Data Center Networks (NSDI ’10)

[2] M. Al-Fares et al. A Scalable, Commodity Data Center Network Architecture (SIGCOMM ’08)

Contacts: Frank Li (frank.li@stanford.edu) and Yiam Watanaprakornkul (yiam@stanford.edu)

Introduction:

ECMP is a common method for routing TCP flows in data center networks. ECMP uses a hash of the TCP  5-tuple to place the flow on a path from sender to receiver. However, long-term hash collisions (such as collisions of large flows) can result in underutilization of the network, even if the topology (such as FatTree) supports full bisection bandwidth [2].

This motivated the initial work on Hedera [1], a centralized flow scheduler that monitors the current traffic matrix and dynamically reassigns network-limited flows to uncongested links. Since TCP’s congestion avoidance mechanisms will throttle flows on congested links, thereby making a flow’s current throughput unrepresentative of its natural demand, Hedera estimates the natural demand matrix by increasing the flow capacity from sources and decreasing the exceeded capacity at receivers until the demands converge. Using these estimated demands, Hedera then applies a hard demand threshold to determine the set of large flows in the network. Each time a new large flow is seen, Hedera tries to place it on a path that will tolerate its natural demand. [1] explores two such placement algorithms: (a) Global First Fit, which makes capacity reservations for each flow and places the flow in the first path whose links have sufficient unreserved capacity, and (b) Simulated Annealing, which assigns energies to different flow assignments and empirically attempts to find the minimum-energy assignment. The central scheduler runs every few seconds, to allow for dynamic monitoring of the network environment.

Methods

M. Al-Fares et al. implemented and tested Hedera on FatTree topologies with k = 4, 16, and 32 over a variety of traffic patterns [1]. We attempt to replicate their physical testbed results (Figure 9) in simulation. We choose only to test Global First Fit because Simulated Annealing is significantly more complex but does not provide much performance gain.

We used Mininet as the simulation testbed, POX as the OpenFlow controller platform, ripl/riplpox for the default ECMP implementation, iperf for flow generation, and bwm-ng for bandwidth monitoring. We constructed a FatTree (k = 4) topology, hooked up the desired OpenFlow controller, and ran each experiment for 60 seconds. Our Hedera controller ran Demand Estimation and Global First Fit every 2 seconds, rerouting flows when necessary. To avoid exceeding Mininet’s maximum simulation bandwidth, we limited the bandwidth of each link to 10 Mbps (as opposed to the 1 Gbps links used in [1]), and we limited the CPU of each host to 4/k^3 = 0.0625 to prevent resource contention.

Like [1], we also ran a control experiment in which all 16 hosts of the k = 4 topology were connected to a single nonblocking switch. This represents the upper bound on expected throughput in our simulation testbed. Note that for our experiments, the maximum theoretical throughput is 160 Mbps, not 16 Gbps as in the experiments in [1].

All experiments were performed on an Amazon EC2 c1.medium instance, with Linux kernel 3.2.0-23.

Results:

We present a side-by-side comparison of the figure in [1] we sought to reproduce (see Figure 1), and our own experimental results (see Figure 2). Like [1] we considered only the middle 40 seconds of each 60-second run. Note that each bar in our results represents the median bandwidth of 5 separate runs on the same traffic matrix, and the error bars denote the 5th and 95th percentile bandwidths observed. Because [1] did not specify the precise setup of the randx and hotspot traffic patterns, we did not run those tests.

Figure 1. Physical testbed results (Figure 9) from [1]

Figure 2. Simulation results, 1 flow per host

In most tests, Hedera slightly edges out ECMP, although the difference is usually within the margin of error. In this sense, we replicated the gist of Figure 1. Nonetheless, there are a number of discrepancies. For example, the six StaggeredProb tests we ran showed low throughput on the control nonblocking switch. This may simply be the result of unfortunate randomness. However, the control benchmarks are otherwise more or less consistent, which leads to a more fundamental difference: ECMP performance in [1] is generally much worse relative to the control than in our results. The Stride tests are particularly telling. The figure in [1] shows ECMP performance degrading in a way expected of Spanning Tree as stride increases; that is, Stride-2 has around 50% of Stride-1 throughput (as the flows from the 2 hosts connected to the same edge switch will travel the same path to the same destination edge switch), and Stride­-4 and Stride-8 has around 25% of Stride-1 throughput (as each flow must traverse through a core router, and only one core router is part of the spanning tree). To validate our hypothesis, we ran the same set of experiments on Spanning Tree and Random routing implementations (see Figure 3).

Figure 3. Simulation results, 1 flow per host

The close correspondence in relative performance between ECMP in [1] and Spanning Tree in our results raises questions about the particular implementation of ECMP tested in [1].

We also noticed that all of the traffic matrices tested in [1] involved only one flow per sender. As ECMP performance is expected to improve under an increased number of flows, we tested also with 2 flows per host as well as 6 flows per host, half of which are randomly dropped (so that hosts have different numbers of flows). The results overwhelmingly show that in realistic settings with more than just one flow per sender, the difference between ECMP and Hedera (and indeed, even the nonblocking optimal) is negligible.

Figure 4. Simulation results, 2 – 6 flows per host

Figure 5. Simulation results, 2 – 6 flows per host

Lessons Learned:

The original version of riplpox we used had a broken ECMP implementation, which hashed flows based on the (src edge switch, dst edge switch) pair as opposed to the full flow 5-tuple. This actually resulted in Spanning Tree-like behavior on the Stride tests and motivated our secondary tests comparing Spanning Tree, Random routing, and ECMP. This bug was subsequently corrected, but the presence of such an error shows how easy it is to misimplement ECMP. While we cannot conclude so without consulting the original source code used for the experiments, we suspect that a similar misimplementation of ECMP in [1] led to Spanning Tree-like performance.

While [1] portrays Hedera in a very positive light, we found some of the testing methodology questionable. Aside from the Spanning Tree-like ECMP results already discussed, we realized that the traffic matrices tested very artificial setups that did not even highlight the dynamic nature of Hedera’s demand estimation. Because each sender sends only one flow, demand estimation will treat every flow as a big flow, and the tests do not show the interplay between mice and elephant flows that would be necessary for consideration in a real-world deployment.

Furthermore, since the testing procedure spawns a single flow per host which lasts for the duration of the experiment, each of the flow placement algorithms (Global First Fit and Simulated Annealing) would effectively run once at the beginning of the test and not update any flow placements during the test. Thus the results in [1] seem not to address the overhead of dynamic flow path assignment.

Finally, while Mininet was very reliable in reproducing results, we often ran into technical limitations that constrained the simulations we could run. For instance, we initially hoped to test on large topologies like the FatTree k = 16 or k = 32 topologies simulated in [1], but we found even k = 6 to be hard to simulate due to resource constraints. Nonetheless, after the initial setup, running the experiments on EC2 posed no problems. A single c1.medium instance sufficed for all of our experiments.

Instructions to Replicate This Experiment:

  1. Launch a new instance in the US West (Oregon) region on EC2, with AMI cs244-mininet-mptcp-dctcp. A c1.medium instance should be sufficient for replicating our results.
  2. When the instance is up, sudo edit the default configuration in /boot/grub/menu.lst. Change line 14 from ‘default 2’ to ‘default 0’.
  3. sudo reboot
  4. After the instance is up, run
    sudo apt-get install -y linux-headers-`uname -r`
    sudo dkms install openvswitch/1.4.0
    sudo service openvswitch-switch restart
  5. Check out the code repository
    git clone https://github.com/strategist333/hedera.git
    cd hedera
  6. Run the test script. Our full results can be replicated using test.sh 5 60, but for brevity we recommend test.sh 1 30. Even so, we recommend using screen to prevent accidental network drops from terminating the experiment early. See the README for more details.
    screen
    sudo ./test.sh 1 30
Advertisements

One response to “Hedera

  1. Christophe and I have successfully followed the instructions to replicate the results. The steps were easy to follow and everything went smoothly. All four graphs were generated and matched the results in the blog post. Great job!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s