Fairness of Jellyfish vs. Fat-Tree


Team: Chanh Nguyen and Christophe Chong.

Key Result(s): Jellyfish topology using k-shortest paths routing demonstrates good fairness compared to Fat-Tree using ECMP routing. MPTCP is used on both topologies.

Source(s):

Original paper: Jellyfish: Networking Data Centers Randomly. Ankit Singla, Chi-Yao Hong, Lucian Popa, P. Brighten Godfrey.

Fat-Tree paper: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson.

MPTCP paper: Improving Datacenter Performance and Robustness with Multipath TCP. Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, Mark Handley.

Contacts: Chanh Nguyen (chanhn@cs.stanford.edu), Christophe Chong (cdchong@stanford.edu)

Introduction:

The authors of the paper introduce Jellyfish, a degree-bounded random graph topology among top-of-rack switches. It is offered as a more cost-effective alternative to the Fat-tree. With Fat-trees, the path lengths between hosts are generally too long, and it can be difficult to incrementally increase the size of the data-center.

In Jellyfish topologies, one can create a graph by choosing two switches with free ports at random and connecting them until there are no more pairs of free ports. When a new switch is added, its free ports can be incorporated by randomly removing an existing link and then connecting the two freed ports to the new switch. The goal of Jellyfish is to improve hardware utilization and allow more efficient data-center growth while maintaining throughput, fairness, and resilience comparable to a Fat-tree.

Figure 1. (a) Fat-Tree. (b) Jellyfish. This figure from the original
Jellyfish paper illustrates the number of nodes that are within
2 to 6 hops. Jellyfish has more nodes that are within 4 hops
of each other in this example.

We will be replicating an experiment that demonstrates the fairness of Jellyfish against that of Fat-tree. The authors used the simulator developed by the authors of the Multipath TCP paper to study the throughput vs. rank of flow for Jellyfish vs. Fat-tree:

Figure 2. Fairness of Jellyfish and Fat-Tree from original Jellyfish
paper. This is the figure we are trying to replicate. Flows were
sorted by throughput, with Jellyfish having slightly more flows.

Methods

The Jellyfish paper did not describe the specific transport and routing protocols used in the experiment, so we decided to contact author Ankit Singla. Ankit revealed that we would need to run MPTCP and use ECMP routing for Fat-Tree and k-shortest path routing for Jellyfish. They used a k=10 Fat-Tree and k=8 shortest paths. However, we are limited by the size of our EC2 instance, so we will scale down the experiment to a k=4 Fat-Tree, and scale our Jellyfish topology to be the same size, which is 16 hosts, 20 switches, and 4 interfaces per switch.

Instead of using the simulator from the MPTCP paper, we will be using Mininet network emulator. We started with the Amazon EC2 MPTCP enabled AMI provided by class TA instructors Brandon and Nikhil. We used their RipLPox code, which provides a simple network controller; and RipL, which provides the Fat-Tree topology and ECMP routing. We augmented the RipL code to add the Jellyfish topology and k-shortest paths routing.

In order to measure fairness, we followed the method described in the original MPTCP paper (as did the Jellyfish authors), which was to generate a permutation matrix where every host starts a flow with one other host, with the condition that no host receives more than one flow. The MPTCP paper provides the following figure demonstrating the MPTCP fairness advantage in a Fat-Tree.

Figure 3. Fairness of MPTCP vs Single Path TCP
from original MPTCP paper. 128 node Fat-Tree.

As a sanity check, we confirmed that MPTCP was indeed working and providing the advertised fairness benefits:

Figure 4. From our emulation on Mininet, MPTCP shows better
fairness compared to Single Path TCP. We used a Fat-Tree
with 16 hosts on ECMP routing and 60s flows. We ran the
experiment 8 times and averaged the results.

We measured throughput in two ways: iperf and bwm-ng. With both tools, we removed the first and last 5 seconds of the flow transfer rate readings to remove some of the startup and teardown effects. We found that both tools provided roughly the same readings:

Figure 5. Iperf and bwm-ng provide similar throughput measurements.
Experiment performed on MPTCP on k=4 Fat-Tree and ECMP routing.

We decided to use bwn-ng to measure throughput for each host, and ignore the first and last 5 seconds of every flow. We used flows of 60 seconds and 15Mbps links, giving each host 1/16 of the total CPU.

Results:

Figure 6. Fairness of Jellyfish and Fat-Tree flows sorted by throughput.
The green line shows MPTCP on a k=4 Fat-Tree with ECMP routing.
The blue line shows MPTCP on a 20-switch, 16-host Jellyfish topology
with k-shortest paths routing. We used 4-shortest paths since the graph
was so small that exploring more than 4 paths resulted in very long and
inefficient paths, defeating the purpose of Jellyfish. The lines represent
averages over 8 runs for each topology, where the Jellyfish topology and
traffic matrices were randomized, and flows lasted 60s.

Our results show that the Jellyfish topology maintains a surprisingly fair normalized throughput compared with that of Fat-Tree. We say “surprisingly” since the original paper used over 200 hosts where we use 16 (with a concomitant difference in switches), and the original results show tighter relative convergence in throughput fairness between topologies after some minimum number rank. That is, the normalized throughput in the original result showed somewhat slight but noticeable differences until at least the 100th flow, which is close to when the highest-ranked flows converged to max throughput.

Assuming we have replicated the key results, we have shown that Jellyfish maintains relatively good fairness with even a small number of hosts. This is important in establishing that the Jellyfish topology is desirable not only for its friendly scaling properties, but also for its throughput fairness in small datacenters — a sort of cradle-to-the-grave panacea for datacenter topology headaches.

Lessons Learned:

Most of the experimental setup was not specified, so we needed to contact the authors directly. The original paper tested both topologies with hundreds of nodes. However, given our budget, we were limited to a single EC2 instance to run our emulation. We had to limit the number of hosts to approximately 16, otherwise the experiments wouldn’t complete in a timely manner (we tried k=6 Fat-Tree and our instance couldn’t handle it). However, the results we achieved with the scale limitation were indicative of improved fairness for both topologies.

It was great that the TA’s provided an EC2 AMI that had MPTCP running out of the box, as well as code for Fat-Tree and ECMP routing. This made it much easier for us to just insert a new topology and routing protocol. However, we did spent a lot of time getting MPTCP to demonstrate good fairness on Fat-Tree. We had to trace the flows to confirm that MPTCP was indeed using multiple subflows, and that setting net.mptcp.mptcp_ndiffports=8 would create 8 subflows. We found Mininet pretty simple to use, although it takes a while to learn how to use all of the tools it provides to debug.

We found that using too low of a link bandwidth led to choppy connections, while too high of bandwidth would take too long for emulation to complete, so there was a tuning aspect here. In the future, it would be great to have better integration of RipLPox and Mininet so that they can use the same Topo objects and they can tell each other when they have finished initializing or shutting down. It would also be nice to be able to run multiple Mininet/RipLPoxes at the same time, so that we can run multiple experiments in parallel.

Validation:

We demonstrated that the relatively good fairness was due to MPTCP taking advantage of multiple paths by comparing it to Single Path TCP in Figure 4. Checking the interface utilization with dpctl showed that the root nodes in the FatTree, for example, were being hit with equal traffic. We checked the validity of our measuring tools by finding that two different flow rate monitoring tools, iperf and bwm-ng, reported roughly the same throughput as shown in Figure 5. We also made sure that the results were not due to a specific traffic matrix by running the experiment multiple times and averaging the results. For Jellyfish, we randomized the topology on every run to make sure we didn’t just get lucky with the topology. This increased the running time as we had to set up and teardown our topology on every run, but it meant consistent environments for every run. Although we did not use the same network size as the paper, we tried to match the size of the Jellyfish topology to our scaled-down Fat-Tree. We tested our experiment across a range of link speeds and found that link speed did not effect the result much.

Instructions to Replicate This Experiment:

These instructions should create the throughput fairness graph in a file called “rank.png”

1. Create a new Instance in US West (Oregon) region
2. Use the MPTCP enabled AMI provided by CS244 TA’s by searching for “mptcp.” The AMI is called “cs244-mininet-mptcp-dctcp”
3. Choose “c1.xlarge” as the instance type.
4. After your instance boots up, ssh into it.
5. Run “git clone https://bitbucket.org/chanh/cs244_nguyen_chong.git”
6. Run “sh cs244_nguyen_chong/install.sh”
7. Run “./start-jellyfish.sh”
8. A graph will be created with the name “rank.png”

Acknowledgements:

Big thanks to Brandon and Nikhil for their awesome code and guidance, and big thanks to fellow classmate Josh Valdez for giving us MPTCP advice. And lastly, thank you Ankit Singla for writing the Jellyfish paper and for answering our questions even while traveling.