Background: We based our work on Jellyfish: Networking Data Centers Randomly by Singla et. al. This paper examines existing techniques for connecting top-of-rack switches in a datacenter and presents the surprising result that connecting switches randomly has many desirable properties relative to a purely structured topology.
Ordinarily, switches within data centers are arranged in a carefully planned hierarchical topology. The authors describe several drawbacks to the conventional approach:
- Most topologies have strict structural requirements based on the number of ports on each switch.
- Connecting a switch with an arbitrary number of ports is not possible.
- The topology limits the size of the network.
- Traditional topologies do not support incremental expansion: adding a new switch can often require rewiring (or completely replacing) every switch in the datacenter. The only way to allow for future expansion is to leave ports on switches completely empty, which is a waste of resources.
The Jellyfish design proposed in this paper solves these problems by connecting switches randomly. The authors find that this approach has a number of advantages:
- The Jellyfish design supports switches with an arbitrary number of ports.
- The network can have an arbitrary size.
- Incremental expansion is simple. Adding a new switch requires rewiring just a handful of edge switches.
- Using the same switch equipment, Jellyfish can support 25% more servers than the fat-tree topology (a commonly used rigid topology)
- A Jellyfish network has a shorter average path length than traditional networks. This gives it higher bandwidth and throughput.
Author’s Motivation: Incremental expansion is important. When building a datacenter, it’s difficult for a company to plan out exactly how much capacity it will need. In the current cloud-centric world, data centers need to scale quickly; the authors give the example of Facebook, which went from 30,000 servers in November 2009 to 60,000 servers in June 2010. A rigid topology inhibits expansion by increasing the time and cost of adding new switches.
This observation makes the paper very relevant. The proposed Jellyfish topology makes incremental expansion simple, thus making it much easier for companies to scale up their data centers as demand increases. There is no need to plan the data center topology in advance or worry about rewiring it entirely during evolution. Additionally, the Jellyfish network has higher bandwidth and throughput.
Our Goal: We chose to reproduce Figure 9, shown below:
This chart shows the distribution of the number distinct paths for each link. The y-axis shows the number of distinct paths a link is on. The x-axis is an ordered ranking of the links. The graph shows the distribution using three different path selection techniques: 8 Shortest Paths, 64-way ECMP and 8-way ECMP.
We chose this chart because it is a distillation of the paper’s results. Earlier in the paper, the authors perform many theoretical calculations of how a Jellyfish network has better performance than other networks. This performance gain comes from the fact that there are many paths between nodes, resulting in more links being used fully and therefore higher network bandwidth. Figure 9 demonstrates that there do exist routing techniques that can leverage this property.
A Note on Terminology: To avoid confusion, we have adopted specific terminology we will use throughout this post. A “graph” is a graph in the CS sense, modeling switches and links in a network. A “chart” is a visual representation of numeric data.
Our first step toward reproducing Figure 9 was to generate a model of a Jellyfish network. We created a Python class for generating a Jellyfish-style network graph.
The graph generation algorithm in the paper is: “Pick a random pair of switches with free ports (for the switch-pairs are not already neighbors), join them with a link, and repeat until no further links can be added. If a switch remains with 2 free ports (p1;p2) — which includes the case of incremental expansion by adding a new switch — these can be incorporated by removing a uniform-random existing link (x;y), and adding links (p1;x) and (p2;y).”
Implementing this algorithm seemed simple, but we encountered an issue not discussed in the paper. It is possible to arrive in a state where at least one switch has exactly one open port, it is not the only switch with open ports, and it is connected to all other switches with open ports. We resolved this situation as follows:
– When all links that can be trivially added have been added:
– Select a random switch A that has an open port
– If the selected switch has only one open port, randomly select one of its links and disconnect it. This is the key step that we added to handle the undiscussed case
– Randomly select two switches B and C that have no open ports, such that A is not connected to B, A is not connected to C, but B and C are connected
– Disconnect B and C
– Connect A to B, and connect A to C
We used the Graphviz (http://www.graphviz.org/) library to visualize the graph we generated. A sample graph is displayed below. The original paper did not specify how many switches and ports were in the network they used. We ended up running it with a variety of sizes, as we will explain later. This was a compromise between network size and simulation runtime, while using a number of ports that is available on commodity switches. Below is an image of 35 switches with 5 ports each – a smaller graph is displayed so that the structure is visible.
Our next step was to run the path selection algorithms on our generated graph. A switch uses a path selection algorithm to decide which path to use to send a packet to the desired destination. Instead of always using the shortest path, switches in data centers take advantage of multiple paths to the destination in order to increase throughput.
The paper compares two path selection algorithms:
- KSP (K-Shortest Paths): Choose a path at random from the k shortest paths from the source to the destination. The paper uses k=8.
- ECMP (Equal Cost Multiple Paths): Choose a path at random from the paths with the lowest cost. The paper tested both 8-way and 64-way ECMP, which choose randomly from only 8 and 64 of the paths with the lowest cost.
We used YenKSP (https://github.com/Pent00/YenKSP), a Python implementation of the KSP algorithm. Given a graph, a value for k, and source/destination nodes, YenKSP returns the k shortest paths between the source and destination nodes. We added code to record the number of paths each link is on.
While KSP chooses between the k shortest paths regardless of cost, ECMP chooses between only the paths with lowest cost. Therefore, the paths in ECMP are a subset of the paths in KSP. While recording the number of distinct paths under KSP, we could also record the number of paths for ECMP by examining only those paths sharing the lowest cost.
While the original paper studies both 8-way ECMP and 64-way ECMP, we only studied 8-way ECMP. Our network graphs are not large enough for 64-way ECMP to make much of a difference. Additionally, 64-way ECMP would have significantly increased the computation time required.
We used matplotlib (http://matplotlib.org/) to create our version of Figure 9, using 84 switches with 12 ports each.
The chart we generated has the same general shape as the original. KSP consistently has a larger number of paths than ECMP, with the difference increasing at higher ranks. Both lines have a relatively flat slope for most ranks, then sharply increasing at the end.
The biggest difference between our chart and the original is the scale of the y-axis. In their chart, the KSP line’s sharp increase occurs at a y-axis value of approximately 12 paths. In our chart, it occurs at around 200 paths. We suspect this may be related to the size of the test network. The original authors did not specify the size of the network they used to generate the data that went into their Figure 9. We suspect it was a large network, which therefore had many possible paths and so each link was used in relatively few of them. A smaller network would cause less path variability, and thus a smaller number of paths per link.
The original paper has only one Figure 9, which shows the properties of a single network. However, the networks are being generated randomly, and so it’s very unlikely that another network will ever look exactly the same as the one used by the original authors. We were curious how Figure 9 changes across multiple random networks of the same size.
Therefore, we ran the simulation 20 times, each time generating a random graph of 35 nodes with 5 ports each. Below is the path variability chart. As we thought might be the case, the number of distinct paths per node varies widely depending on the graph.
Interested to see what would happen with a bigger graph, we ran the simulation 20 times, each time generating a graph of 80 switches with 20 ports each. Below is the path variability chart. Interestingly, there is little variance between runs.
We suspected this new lack of variance might be due to the lower ratio of switches to ports per switch. To explore this, we ran the simulation 20 times, each time generating a graph of 84 switches with 12 ports each. This maintains the 7:1 switch:(port per switch) ratio we used in the 35 switch, 5 ports per switch run. Below is the path variability chart. As we expected, this restored the variation.
This analysis gives an interesting result. In order for the randomly generated graph to have reliably desirable properties, the switch:port/switch ratio must be low – much lower than is realistic in a data center. Thus, randomly generating a number of graphs and choosing the best one seems to be a worthwhile exercise.
The provided script for replication (instructions below), runs the experiment with each of these parameters.
As described previously, we found that a high switch:port ratio causes variability in the average number of distinct paths in random networks. In any realistic setting, a data center is going to have many more switches than ports, and so variability is inevitable. This experiment does not show whether the worst-case graph is “bad” in terms of throughput. If the worst-case performance is still acceptable, this does not present an issue.
If the worst-case performance is not acceptable, then data center admins must simulate the performance of the network before deploying the Jellyfish topology, generating a new graph if the topology is not sufficient. This makes incremental expansion harder, as the random graph obtained through incremental expansion may have less acceptable performance.
Therefore, while the original paper’s result remains very promising, further research should examine how performance varies between different random graphs.
The original paper lacked many crucial details that made reproducing the results difficult.
As described previously, the graph generation algorithm in the paper was not complete. We had to figure out how to work around an edge case in the algorithm that resulted in an infinite loop.
The original paper also did not specify the experiment setup used to generate Figure 9. They say only that they used a setup of 686 servers, but do not specify the number of switches used or the number of ports available in each switch. Therefore, we can only guess at the correct parameter when running our own simulation.
We were also concerned with the runtime of our simulation. For each graph, we needed to run KSP N(N-1)/2 times, where N is the number of nodes in the graph. The KSP algorithm was already roughly O(N^2), and the implementation we used takes significantly longer as the number of ports increases. Finally, we repeated these steps for 20 graphs. The resulting simulation takes about an hour to run.
Our simulation is a series of Python scripts. Reproducing Figure 9 relies only on standard graph algorithms and does not require anything specific to networking. Therefore, we chose to minimize the number of dependencies as much as possible. We wrote the graph generation module ourselves, and built on top of the YenKSP library for an implementation of the KSP and ECMP algorithms. Additionally, we use the matplotlib library used to generate the charts.
Steps to Reproduce
1. Create a new EC2 instance. c3.large and c3.2xlarge are both okay. Use our AMI, “CS244-Jellyfish-Grodman-Silver”, which can be found by searching in “Community AMIs” in US West (Oregon). Follow the instructions on the CS244 website (http://www.stanford.edu/class/cs244/ec2.html). Be sure to follow the instructions for setting up a security group that allows SSH access and web server access.
2. Once the instance is running, SSH into it
3. Go to the ~/jellyfish directory
4. Run the command: ./runme.sh
5. This will run the entire simulation, which takes about an hour. When it is finished, the results will be in ~/jellyfish/output_35_5/, ~/jellyfish/output_80_20/, and ~/jellyfish/output_84_12/
6. To analyze the output images, it may be useful to run “python -m SimpleHTTPServer”. Then, you can open [machine public DNS]:8000 in a web browser and navigate to ~/jellyfish.
7. When interpreting the results in the output directories, there are three main things to look at
7a) graph[i].png contains a visualization of graph number i
7b) graph[i]_figure9.png contains figure 9 (discussed above) for graph i
7c) variability.png contains a chart of the average number of paths in the 20 graphs generated
Our source code is available at: https://github.com/jgrodman/jellyfish