CS244 ‘15: Hedera Flow Scheduling

Team: Anh Truong, Ian Walsh


As more and more applications move to the cloud, businesses are investing heavily in large, privately-owned data centers to handle the massive amounts of network traffic these applications generate. A defining characteristic of most data center networks is routing redundancy: for all (src, dst) node pairs there are many equal-cost paths along which flows could be routed. The key to achieving good performance and high network utilization is to spread the traffic evenly among all available links: collisions, in which multiple large flows share a bottleneck link, can seriously reduce throughput. Good multipath routing algorithms, then, are crucial for data center performance.

In Hedera [1], the authors try to solve the problem of routing large¹ flows in data center networks, in a way that minimizes the number of collisions on bottleneck links. They use a central scheduler that periodically samples statistics from all hosts in the network, calculates good routes for each large flow, and installs the updated routing entries using OpenFlow. They take as a baseline the widely used Equal Cost Multipath (ECMP) algorithm, which statically assigns flows to paths based on a hash of their headers, and try to improve on its performance.

Hedera Experiments

Finding the optimal assignment of flows to paths in an arbitrary topology is an NP-complete problem, so the authors implemented two heuristics for choosing good routes. Global First-Fit (GFF) is a simple greedy algorithm that assigns a flow to the first possible path with enough unreserved capacity for it, while Simulated Annealing (SA) is a more complicated algorithm that uses a probabilistic search to find good paths. The authors demonstrate that both GFF and SA outperfom ECMP on a wide variety of traffic patterns, and that SA almost always outperforms2 GFF as well. We show the relevant results from [1] below:

Screen Shot 2015-05-18 at 7.45.26 PM

Figure 9 from [1]: Hedera physical testbed experiments

    This experiment ran on a physical testbed of 16 hosts connected together in a fat-tree topology with 1 Gbps links, and the Hedera controller ran in a non-blocking 48-port Ethernet switch connected directly to each host. The x-axis labels are various traffic patterns of large TCP flows, and performance is measured by the network bisection bandwidth on the y-axis. The “Non-blocking” bar represents a different topology: all 16 hosts connected directly to a central switch in a star configuration, and representing an idealized network.

We chose to reproduce these physical testbed results using Mininet: setting up the same fat-tree topology, generating the same traffic patterns, and comparing the performance of the ECMP and Global First-Fit algorithms. We chose these results because the physical setup lends itself well to emulation in Mininet; other results in the paper relied on a custom network simulator that we didn’t have access to. We would have liked to reproduce the Simulated Annealing and Non-Blocking results as well, but due to time constraints we were unable to complete them.


Our experimental results are below:


These results were generated on a “c3.2xlarge” Amazon EC2 instance configured as described in our README. The primary observation is that ECMP and GFF appear to give very comparable performance; each performs better on a subset of the traffic patterns we tested, and their overall performance is roughly equal. It is important to note that, with the exception of the ‘stride*’ traffic patterns, all of these patterns incorporate an element of randomness in their generation. Thus, while we have generated our patterns using the same methods as in [1], our own implementation of “stag0_0203” is not equivalent to the authors’ “stag0(0.2, 0.3)”, due to random variation in how the patterns are generated.

Discussion & Challenges

We were surprised to observe such parity between ECMP and GFF; given the results in [1], we were expecting to see GFF perform much better than it did! Trying to understand our results leads us to reflect on several implementation challenges we faced over the course of this project.

We began our implementation on top of RipL and RipL-POX², minimalistic Mininet implementations of a fat-tree topology and multipath-capable controller (developed by a former CS 244 student, actually!). We found these libraries through an entry in the official Mininet FAQ page, and we were encouraged by the fact that previous implementations of this project³ seem to have used them to good effect. We thought this would save us time by not having to re-implement such basic functionality, but in fact the majority of our time was spent grappling with these libraries and trying to get them to work in our (newer) Mininet environment! All told, we had to (i) change the version of Mininet we were using to an official ‘cs244’ branch, (ii) switch to the ‘dart’ branch of POX, which is documented as the only tested version for RipL and RipL-POX, (iii) find a newer version of RipL-POX4 that targeted later Mininet versions, and (iv) fixing numerous small conflicts in the controller and routing code! Even so, we never got the environment working perfectly – intermittent errors remained!

We saw significant fluctuations in our measurements depending on the environment (EC2 instance, Mininet version, CS 244 VM), a smaller amount of variation between successive runs within the same environment, and, surprisingly, large variations between the performance on each of our laptops! We believe almost all of these variations are due to communication errors between the Mininet controller and switches: sometimes the controller will not see a switch that’s up, or it will miss a heartbeat and drop a switch only to bring it back a few moments later, etc. This seems to be due to either misconfiguration or version mismatch between all the different components we are using: Mininet, POX, RipL/RipL-POX, and our own code. Unfortunately, this suggests that reproducibility is severely hampered by version mismatch and is quite brittle to configuration errors between the controller, switches, and routing protocols.

A final challenge we encountered was that our implementation had no way to limit the link speed, the maximum switch queue size, or to use a ‘CPULimitedHost’ – the version of Mininet we used gave an error about being unable to find a required “mounted cgroup” that we were unable to resolve. This meant that we were unable to fine-tune our GFF reservation policy, and had to settle for reserving coarse fractions of the link capacities – a factor that may have affected our final results.


If we were to do this experiment again, we would not have relied on the RipL and RipL-POX libraries: ultimately wrestling with them and trying to make them compatible with newer versions of Mininet cost us more time than we would have saved developing our own controller from scratch on newer, better-supported versions of all components.

When ECMP statically assigns two or more large flows to the same link, they compete with each other for the limited bandwidth and overall throughput suffers, even though there may be unused capacity along other paths in the network. Our experience using ECMP in Mininet suggests that these collisions do represent a significant problem in practice. Though our experimental results for GFF do not closely match those in [1], we attribute the discrepancy to shortcomings in our own implementation, rather than a flaw in the paper’s methodology. As more and more switches become OpenFlow-compatible and SDN approaches become more mainstream, we foresee centralized flow schedulers like Hedera having an increasingly important role to play in data center performance.

Get the code!

Our project is publicly hosted on GitHub at https://github.com/iwalsh/244proj. It contains a README.md file in the top-level directory with all the instructions for installing and running our code on any Mininet-compatible machine; we recommend using a VM from the course web page (http://web.stanford.edu/class/cs244/vbsetup.html) or a CS 244 Amazon EC2 instance (http://web.stanford.edu/class/cs244/ec2setup.html).


1 The authors define a “large” flow as one demanding >= 10% of a link’s capacity. All links in these experiments were 1 Gbps, so “large” flows are those demanding 100 Mbps or more of bandwidth.

2 https://github.com/brandonheller/riplpoxhttps://github.com/brandonheller/ripl



[1] Al-Fares, M. et al., Hedera: Dynamic Flow Scheduling for Data Center Networks (NSDI 2010)

2 responses to “CS244 ‘15: Hedera Flow Scheduling

  1. Comment on Analysis:
    This research topic was interesting. I could easily understand its relevance for DC. It would have been more interesting if there were more analysis of the results. There seemed to be more focus on the challenges. For example, the report stated that the differences between the results and the original authors’ were “due to random variation in how the patterns are generated” and “shortcomings in … implementation.” If it could be shown how any of those factors contributed to the differences in the results, that would make the results more compelling.

    Reproducibility score: 5
    It was easy to reproduce the results, and they matched the results in the blog post very well. Just a few small comments on reproducibility. It would have been nice if the output gave some indication of how much time was remaining. It just started running and went for 30 minutes, and it was difficult to tell whether the simulation was broken or just needed more time. Also, the instructions said to run ‘python plot_results myAwesomePlot.png’ instead of ‘python plot_results.py myAwesomePlot.png’.

  2. Hello everyone here, I am really curious to know about some traffic patterns Hedera is using to generate traffic in the topology i.e., random and random_bij_data. Could anyone here please explain to me how are these communication patterns differ from each other?
    Thank you for any response. Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s