Hedera


Team: Nikhil Handigol, Brandon Heller, Vimal Jeyakumar, and Bob Lantz.

Key Result(s): ECMP routing in a fat-tree topology leads to low throughput compared to an ideal “non-blocking” network.

Source(s):

  1. M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow scheduling for data center networks. In USENIX NSDI, 2010.
  2. RFC 2991: Multipath Issues in Unicast and Multicast Next-Hop Selection. http://tools.ietf.org/html/rfc2991.

Contacts: Nikhil Handigol (nikhilh@stanford.edu), Brandon Heller (brandonh@stanford.edu), Vimal Jeyakumar (jvimal@stanford.edu), Bob Lantz (rlantz@cs.stanford.edu)

Introduction

In our second example we use Mininet-HiFi to reproduce results that were originally measured on a real hardware testbed by the authors of the Hedera paper (NSDI 2010) [1]. Hedera is designed to replace standard ECMP in datacenter networks. With ECMP, flows take a randomly picked path through the network based on a hash of the packet header. Hashing ensures that all the packets belonging to a flow take the same path [2], preventing mis-sequencing. The Hedera paper shows that this simple approach leads to random hash collisions between “elephant flows” – flows that are a large fraction of the link rate – causing the aggregate throughput to plummet. With this result as the motivation, the paper proposes a solution to intelligently re-route flows to avoid collisions [1], and thus, exploit all the available bandwidth.

More specifically, as part of the evaluation, the authors compare the throughput achieved by ECMP and with that on an ideal “non-blocking” network (the maximum achievable) for 20 different traffic patterns (Figure 9 in the original paper [1]). They evaluate a hardware testbed with a k = 4 Fat Tree topology with 1 Gb/s links. The main metric of interest is the aggregate throughput relative to the full bisection bandwidth of the network.

We replicate the experiment by emulating the same k = 4 Fat Tree topology in Mininet-HiFi. We use the same traffic generation program used by the Hedera authors and generate the same traffic patterns.5 We use code from Ripcord to replicate the two routing strategies to the best of our understanding. We set the link bandwidths to 10 Mb/s, and allocate 24% of a CPU core on our eight core machine to each of 16 hosts (i.e. a total of 50% load on the CPU). We set the buffer size of each switch to 20 packets per port, our best estimate of the configuration of the switches used in the hardware testbed.

Results

(a) Benchmark tests from Hedera paper (Part 1).

(b) Benchmark tests from Hedera paper (Part 2).

 The above figure shows the normalized throughput achieved by the two routing strategies – ECMP and non-blocking – with Mininet-HiFi, alongside results from the Hedera paper for different traffic patterns. The Mininet-HiFi results are averaged over five runs. The traffic patterns in Figure (a) are all bijective – the flows do not contend for link bandwidth – and they should all achieve maximum throughput for a full bisection bandwidth network. This is indeed the case for the results with the “non-blocking” switch. The throughput is lower for ECMP because hashing collisions decrease the overall throughput. There will be more collisions the more links a flow traverses – and so we should expect it to be worst for flows that cross all three layers (access, aggregation, and core). The experiments show the same behavior, as seen in the stride traffic patterns. With increasing stride values (1, 2, 4 and 8), the flows are routed across more layers, and the throughput degrades.

The Mininet-HiFi results closely match those from the hardware testbed; in 16 of the 20 traffic patterns they are statistically indistinguishable. In the remaining four traffic patterns (randx2,3,4 and stride8) the authors explain that the commercial switch in their testbed – which is built from two Broadcom switching chips connected by an under-provisioned link – has lower throughput. To validate these results we would need to know the particular mapping of hosts to switch ports, which was not available.

The above results were obtained from a quad-core server. Below are the results from running the same experiment on a c1.xlarge instance on Amazon EC2.

(a) Benchmark tests from Hedera paper (Part 1).

(a) Benchmark tests from Hedera paper (Part 1).

(b) Benchmark tests from Hedera paper (Part 2).

Lessons Learned

The main takeaway from this experiment is that Mininet-HiFi reproduces the performance results for this set of data-center networking experiments. It appears possible to collect meaningful results in advance of (or possibly without) setting up a hardware testbed. If a testbed is built, the code and test scripts used in Mininet-HiFi can all be reused without change.

Instructions to Replicate This Experiment:

Create an Ubuntu 12.04 c1.xlarge instance running Mininet-HiFi.

cd ~
git clone https://bitbucket.org/nikhilh/mininet_tests.git
cd mininet_tests/hedera

Follow the instructions in the README file there:

https://bitbucket.org/nikhilh/mininet_tests/src/ad08368cf347/hedera/README

Leave a comment