What Did We Do? Why Did We Do It?
Data centers contain racks of servers. Each rack contains a group of servers, with a link run from each server to a single top-of-rack (TOR) switch, thus interconnecting the servers of a rack. The TOR switches are then interconnected, thus interconnecting the racks. Modern data centers interconnect the TOR switches via highly-structured topologies, like the ubiquitous fat-tree. At NSDI 2012, a group from UIUC presented Jellyfish: Networking Data Centers Randomly, suggesting that these structured data center topologies can be outperformed, at scale, by the simplest topology possible: a random one.
A data center topology aims for four objectives:
- High bandwidth across all server-to-server paths
- High utilization of links
- Robustness to link failure
Jellyfish argues that the objective of extendability is proving very important in the modern data center, as companies building and operating data centers can see rapid increases in service demand. Jellyfish further argues that existing topologies do not adequately cater to extendability, resulting in expansion:
- being overly coarse (e.g. requiring servers to be added in quanta of thousands),
- requiring extensive reconfiguration of existing infrastructure (e.g. replacing existing switches with higher port-count switches, or substantial rewiring),
- requiring up-front over-provisioning or a priori expansion targets (e.g. using switches with an excess number of ports, leaving those ports for future use)
The Jellyfish Random Toplogy: To address this problem, Jellyfish proposes a random graph at the TOR switch layer. This random graph is created by picking a random pair of switches that have free ports and are not already connected, connecting them with a single link, then repeating until no further links can be added. Now, we can expand the network a single rack at a time: randomly select a link in the existing graph, remove it, and connect its two nodes instead to the TOR switch being added, and repeat this process until there are no more (or just a single) unmatched ports.
What the Random Topology Achieves: So extendability (Objective #4) is almost ideally achieved (expansion is continual, requires no replacement of existing equipment, and requires no initial planning, although rewiring is still necessitated). Jellyfish then discusses its suitability in terms of Objectives #1-3. To meet Objectives #1 and #2, two things must be achieved: the random graph topology must present high path capacity to server-to-server connections, and routing and congestion control protocols must combine to properly utilize the available capacity—i.e. the protocols must do a strong job of load balancing over the graph.
The Importance of Routing Protocol: Jellyfish determines through simulation that its random graph topology can achieve per server average throughput that is better than that achieved by a fat-tree topology (when both are under a random permutation traffic matrix). This better-than-fat-tree performance is contingent on the routing protocol used: Jellyfish uses a multi-path TCP (MPTCP) congestion control protocol, and pairs it with a k-shortest paths routing algorithm. If Jellyfish pairs MPTCP with an equal cost multi-path (ECMP) routing algorithm, its per server average throughput is notably less than that of fat-tree (see Table 1 in the paper).
The k-shortest paths routing algorithm takes k MPTCP subflows and the set of k shortest paths from sender to destination, and assigns each path to carry a single subflow. The ECMP routing algorithm sends the k subflows down the m paths with lowest cost; if m < k, then each of the k subflows is uniformly randomly assigned to a path.
Why Is k-Shortest Paths Essential to Throughput? In an effort to explain why k-shortest paths is better than ECMP for the random graph topology, Jellyfish applies a random permutation traffic pattern to a random topology of 686 servers, then calculates a plot showing how many paths each link is on. This plot—Fig. 9 in the paper, shown below—is meant to demonstrate that an 8-shortest paths protocol provides more path diversity than an ECMP protocol. The authors then argue that this path diversity underpins the throughput discrepancy between using k-shortest paths and using ECMP. We replicate this result, Fig. 9 from the paper, then discuss why we think the measure is misleading, and propose what we believe to be a more appropriate measure, which we implement.
How Did We Go About Replicating the Result?
Generating the Topology: First, we needed to generate a Jellyfish random topology between a collection of TOR switches. The caption to Fig. 9 indicates that it was generated with “a typical Jellyfish of 686 servers (built using the same equipment as a fat-tree supporting 686 servers)”. In a fat-tree topology comprised of k-port switches, there are k3/4 servers and 5k2/4 switches (k pods, each containing k/2 “edge” switches and k/2 “aggregation” switches, plus k2/4 “core” switches; see A Scalable, Commodity Data Center Network Architecture, SIGCOMM ’08). Therefore, we construct a topology using 686 servers and 245 switches, where each server has k=14 ports. (In later sections, we’ll show results from topologies constructed with k=10 and k=12 ports).
Jellyfish is explicit on how switches are connected: “…pick a random pair of switches with free ports (for the switch-pairs that are not already neighbors), join them with a link, and repeat until no further links can be added”. But Jellyfish is silent on how servers are connected to switches. So we make some inferences. First, we note that in the Jellyfish topology, the randomness of the links implies that no matter what two servers we select, we cannot know a priori the shortest path between them (i.e. we must search the path space to find the shortest path); this is contrasted to a structured topology, in which we can know a priori the shortest path. This suggests that the pool of servers should be evenly distributed across the switches (or as close to evenly as possible).
Therefore, we create two different topologies. In one topology, we cycle through each server and randomly assign it to a switch; this will distribute the servers across the switches in an even fashion probabilistically, meaning there will be some switches with more servers than others, and may be some switches with very many servers (we check to ensure there is no switch with all of its ports attached to servers). In the other topology, we cycle through each server and assign it to switch# = server# % (#_of_switches), thus distributing the servers across the switches in an even fashion deterministically.
Finding Number of Paths Per Link: After generating our two topologies (the probabilistically even and the deterministically even), we implement a random permutation traffic matrix (each server sends at its full output link rate to a single other server, and receives from a single other server, and this permutation is chosen uniform-randomly), and count the number of paths on each link, for both ECMP and k-shortest paths.
Our implementation is in Python—which we chose for ease of implementation—and uses three classes:
- Node –> keeps a list of edges connected to it, and allows you to add an edge
- Edge –> keeps a tuple of the nodes it connects (left node, right node), and a tuple of the number of paths that pass through it (paths moving right-to-left, paths moving left-to-right)
- Path –> keeps the start and end node, and all the edges that constitute the path
Because we are doing graph processing rather than network emulation/simulation, our results should be highly reproducible. We provide the graph topology we generate our plots from; our code also permits ready generation of another topology, which should provide matching results.
For each (send, receive) server pair in the random permutation traffic matrix, we iteratively run breadth-first search (see code for nuances) to find either the m equal-cost paths or the k shortest paths. We then step through each edge in the graph and count the number of left-to-right paths and right-to-left paths it sustains; thus, each edge produces two data points in the plot, which is consistent with the caption in Fig. 9 (“Each network cable is considered as two links one for each direction”).
We did not run into any unusual challenges in the graph processing implementation (graph processing always requires diligent coding….we got stung by an integer-division bug!), but we did run into a problem running it on EC2: the run for k=14 takes about a day, even with high compute resources! So we had to plan our runs carefully before launching them. We did not optimize the code (in order to avoid bugs), but an evaluation of the path search algorithm indicates it cannot be pruned much anyways—a problem it would appear the Jellyfish authors grappled with as well, as their simulations involving multiple path exploration had servers numbered in the hundreds or low thousands.
Did We Manage to Replicate the Result?
Differences Between Probabilistically/Deterministically Even Topologies: As expected, the probabilistically even topology had slightly more links experiencing few paths relative to the deterministically even topology (expected because the probabilistically even topology will do a slightly worse job of spreading servers over the switches than the deterministically even topology). This can be seen in a comparison of the two plots below, for k=10 ports/switch (the fewer the ports, the greater the discrepancy, per the law of large numbers). Because the deterministically even topology represents the optimal outcome of the probabilistically even topology, we only present figures for the deterministically even topology from this point forward.
Replication of the Result: Below is our plot of the paths-per-link distribution for k=14 ports/server, adjacent to the Fig. 9 result of Jellyfish. The distribution we found matches very well with the distribution found in Jellyfish.
PLOT OF K=14 COMPARED TO FIG.9
Now that we have replicated the result, we discuss why we think the measure presented in the result is misleading, and we implement what we think is a more appropriate measure.
Is the Result the Right Thing to Measure?
Keep the Measure, Change the Axis?: The authors state “ECMP performs poorly for Jellyfish, not providing enough path diversity”, and point to Fig. 9 as a demonstration of this, highlighting that ECMP links support fewer paths than the k-shortest paths links. But we think this overlooks a critical point: if each pair of sender–>receiver servers results in 8 paths for k-shortest paths, and a number of paths strictly less than 8 for ECMP, then the cumulative number of paths is not equal between the two cases. It remains valid to compare how paths are distributed across the edges in the k-shortest paths and ECMP cases, but it seems misleading to compare those distributions on an absolute scale of # of Distinct Paths link is on.
Instead, the scale should be normalized. But how to normalize? Not only does k-shortest paths result in more paths cumulatively in the graph, it also causes paths to be longer, and a longer path is counted by more links than a shorter path. So to normalize, we count the total number of path counts in the graph (i.e. we initialize total_count=0, iterate through the edges, and for each edge perform total_count+=num_paths_on_link), then divide each link’s path count by total_count. We perform this scale normalization for k=12 ports; the plot is shown below.
The normalization allows us to see that the paths-per-link distributions of ECMP and k-shortest paths are actually very similar. It remains true that “ECMP provides less path diversity than k-shortest paths”—the k-shortest paths distribution better approximates a uniform distribution, which represents optimal path diversity—but the discrepancy is quite small!
Change the Measure: Examining paths-per-link really serves as a proxy for the byte throughput being asked of a link—i.e. the more paths through a link, the more flows contending for its use. But as we’ve seen, this proxy presents issues when trying to compare the k-shortest paths and ECMP cases. Instead, we propose measuring the byte throughput being asked of each link directly, which would allow us to compare the two cases without interpretation issues. This requires us to make a number of assumptions:
- All links in the network operate at the same rate.
- All server pairs aim to transmit the same amount of data.
- MPTCP assigns the same amount of data to each of the paths provided it.
We suspect the Jellyfish authors avoided measuring byte throughput—and instead measured number of paths—because they wanted to avoid making such assumptions, which can be readily questioned; e.g. data center workloads typically are not uniform, so the server pairs would not all transmit the same amount of data. But we suggest that because the purpose of the measure is to provide a relative comparison between k-shortest paths and ECMP, as long as our assumptions are not far removed from the actual operating environment of the data center, and any distortions introduced by the assumptions bias both cases equally, then our throughput measure provides better support for choosing k-shortest paths over ECMP than the path number measure.
With these assumptions in place, we do the following:
- Iterate through all server pairs in the traffic matrix; for each server pair, do the following:
- Determine the n paths connecting sender–>receiver (n=k for k-shortest paths, and n=# of equal-cost shortest paths for ECMP)
- Iterate through all n paths; for each path, do the following:
- Step through every link along the path; for each link, add 1/n to the throughput being requested of the link
This provides a direct measurement of the byte throughput being asked of each link, and can be compared directly between the k-shortest paths and ECMP cases. We calculate such a plot for k=12 ports; the plot is shown below.
The k-shortest paths curve is closer to a uniform distribution than is the ECMP curve, meaning using an ECMP routing protocol will result in more links being highly contested, causing congestion and reduced throughput. Thus, this plot again confirms the thesis presented by the Jellyfish authors that “ECMP provides less path diversity than k-shortest paths”—but we believe it does so in a more rigorous way than the paths-per-link measure.
Jellyfish proposes a random graph at the TOR switch layer, permitting excellent extendability relative to the structured topologies (e.g. fat-tree) prevalent in the modern data center. Jellyfish claims such a random topology, when paired with k-shortest paths routing and MPTCP congestion control, actually outperforms a fat-tree topology in terms of per server average throughput. MPTCP is typically paired with ECMP routing, but Jellyfish found that using ECMP on its random topology caused per server average throughput less than that of fat-tree. In an effort to explain this, Jellyfish presented Fig. 9, showing the paths-per-link distribution of k-shortest paths versus ECMP, and concluded that “ECMP provides less path diversity than k-shortest paths”.
We wrote a Python program that replicated Fig. 9. We then highlighted why we felt this result was an improper comparison, and performed a normalization to permit a proper comparison; this proper comparison still supported the conclusion that “ECMP provides less path diversity than k-shortest paths”, but the discrepancy between the two was much reduced. We then went one step further, and argued that the paths-per-link metric was sub-optimal, and proposed a better metric: throughput requested of a link. Using this metric, we presented a different plot. This plot again supported the conclusion that “ECMP provides less path diversity than k-shortest paths”, but did so in what we consider to be a more rigorous, logical manner.
To run our code, access the Github repo at https://github.com/kggriswold/CS244-15-The-Intuition-Behind-Why-a-Randomly-Networked-Data-Center-Works. Download the random-graph.py file, and execute it according to the README.