CS244 ‘17: Dcell: A Scalable and Fault-Tolerant Network Structure for Data Centers


Team: Saachi Jain and Mitchell Dumovic

Key Result: DCell is a recursive, scalable network topology that quick recovery from link and router failures.

Source(s): Chuanxiong Guo , Haitao Wu , Kun Tan , Lei Shi , Yongguang Zhang , Songwu Lu. Dcell: a scalable and fault-tolerant network structure for data centers. SIGCOMM08. 2008.  https://www.microsoft.com/en-us/research/publication/dcell-a-scalable-and-fault-tolerant-network-structure-for-data-centers/

Contacts: saachi@cs.stanford.edu, mdumovic@cs.stanford.edu

I. Introduction

DCell is a network structure designed in 2008 that was built to improve on the currently data center network (DCN) structures. In particular, DCell was built with three main goals in mind: scaling, fault tolerance, and high network capacity. At the time that DCell was written, the predominant network structures were tree-based, and improving on this tree based structure for data centers was extremely important for two main reasons. First, the size of data centers was rapidly increasing, and the tree structure used at the time could not support the scalability that was needed for large corporations like Google and Microsoft. Secondly, the same tree structure was unable to meet increasing demands for network bandwidth due to the increased popularity of high-bandwidth activites and algorithms like MapReduce.

The defining characteristic of DCell is its recursive structure. DCell0  is the basic building block composed of a small set of servers and a single mini-switch. A DCell1 is formed by connecting many of these DCell0 blocks together, a DCell2 by connecting many DCell1 blocks, and so on. Figure 1 below shows the example topology for a DCell1  graph. This is the topology that is used in the later described experiments and the one we replicate.

Screen Shot 2017-06-03 at 7.44.15 PM

DCell’s recursive structure allows individual servers to be interconnected using only mini-switches. As a result, DCell gains many benefits. The lack of expensive core switches and routers saves on money and prevents the network from having single bandwidth bottlenecks at these switches. DCell also incorporates a novel routing algorithm and fault-tolerant routing scheme called DFR. DFR distributes network traffic evenly among servers and links in DCell, meaning that DCell does not suffer from single points of failure unlike the normal tree structure. Additionally, DFR makes DCell extremely scalable, meaning even a low-degree DCell can support on the order of millions of servers without using expensive core switches.

II. Evaluation and Results Summary

The authors evaluate their proposed topology in both simulated and real environments. In the simulations, the authors create two DCell topologies (one with 176,820 nodes and the other with 865,830 nodes). They then have a randomly selected node ping all of the other nodes under randomly generated failures, calculating both the path failure ratio and average path length. Over 20 simulation runs, they compare the performance of DFR to shortest path first (SPF). They find that DFR performance almost identically to SPF under node failures, and performs better than SPF for higher numbers of nodes. Moreover, the difference in path length between DFR and SPF is relatively low. The simulation results generally show that DFR is robust at a low cost to path length.

The authors test robustness to failure and throughput of DCell using two separate experiments. In the first experiment to test robustness, a 20 server node topology using DCell is created. A TCP connection is set up between two servers in different DCell0 clusters. The throughput over this connection is measured over time. To test link failure behavior, a link is manually shut down and re-plugged.  Similarly, to test robustness to server failure, an intermediate node is shut down and re-plugged. The authors find that DCell can recover quickly to both failures because DCell can detect and adapt to these failures quickly. Moreover, the authors find that DCell can more quickly adapt to link failures rather than node failures because of the medium sensing used to detect neighbor states. Overall TCP throughput between two hosts who use the broken link or node on the normal path from one to another is displayed to show the robustness to failures of DCell. The original image from the paper is displayed in Figure 2.

Screen Shot 2017-06-03 at 7.46.53 PM

In the second experiment, the authors test throughput of the system using 20 servers arranged in either a DCell or a two-level tree structure. Each server establishes a connecting to the other 19 servers (resulting in an all-to-all traffic pattern) and 5GB of data are sent through the connection. The aggregate throughput is then measured for each topology. The authors found that DCell is 2 times faster than the tree topology. The original results from the paper are displayed in Figure 3.

Screen Shot 2017-06-03 at 7.47.01 PM

We chose to reproduce Figure 2 and reproduce the DCell curve of Figure 3 for the purposes of this assignment. We believe that these two experiments test the robustness of DCell and its behavior under congestion, both of which are key characteristics for its usefulness as a scalable data center topology.

III. Methods

We ran our two experiments on a Google Cloud instance running a Linux operating system.  We used Mininet to construct the DCell topology displayed in Figure 1, which is the same topology used in both experiments. We chose Mininet because it allowed us to rapidly develop and simulate the topology without needing multiple servers. It also integrates well with POX which we used to program the controllers.

One limitation of mininet is that it does not allow for hosts to forward packets, so we modified the original DCell topology slightly by attaching a switch to each individual host. This switch serves the same forwarding role that the hosts do in the original paper, and appropriately routes packets to their attached hosts when needed. A diagram of this replicated topology is displayed in Figure 4. We chose to use 100Mbps bandwidth links instead of the original paper’s 1000Mbps links because we ran into issues with Mininet not being able to handle the extreme amount of traffic in experiment 2 on these links. As such, the results we got ended up being scaled down by approximately a factor of 10 when compared to the original paper’s results.

Screen Shot 2017-06-03 at 7.48.18 PM

We implemented the DCellRouting and Local-reroute and proxy sections described in the paper using a custom POX controller running over openflow. There was no need to implement the  Broadcast section of the paper because the openflow controller has access to the global state of the network and immediately discovers link failures. Additionally, there was no need to implement the Jump up for Rack Failure section because the first experiment does not test rack failure, only link and server failure.

To reproduce experiment 1, we started an iperf connection from h00 to h43. At 34s, we brought the link between s03 and s40 down. At 42s we brought the link back up. This action tested link failure. To replicate server failure, we down all links attached to the dropped server (thus removing it from the topology). As described in the paper, we shut down server s0 at 104s. In figure 4, both the original path and the rerouted path after server/link failure are displayed. The original paper uses a heartbeat message once per second to broadcast link failures, so link failures are only detected after a full second has passed and host failures are only detected after a five second link timeout has occurred. However, the openflow controller immediately detects link failures. To model this delayed detection, we introduce an artificial delay of one second for link failure and five seconds for node failure. Without this artificial delay, the openflow controller would react much more quickly to link and node failure.

To reproduce experiment 2, we started an iperf connection from every host to every other host at the same time. To measure total throughput, we summed the throughput reported by each host at each second. While the paper sent 5GB per connection, we sent 250MB because mininet cannot handle such a large load.

IV. Results

Figure 5 shows a side-by-side comparison of our results with Figure 2 displaying the original results from experiment 1 in the paper. As in the paper, we see two drops in throughput at the 34 and 104 second marks when we hit link and node failure. Overall throughput quickly recovers after the sleep is done and our controller changes the flow table to reroute packets accordingly. The overall throughput is scaled down by approximately a factor of 10, which is expected due to the smaller bandwidth links we used in Mininet.

Screen Shot 2017-06-03 at 7.53.00 PM

To show that our routing algorithm performs as expected, we also looked at the number of packets passing through the alternate route over time. Our results are displayed in Figure 6. As expected, packets start routing through the alternate route at around 34 seconds when the link is dropped, stop routing through that link after the link is restored, and then start routing through that link again once the server is dropped at 104 seconds and continue that way until the end of the experiment.

Screen Shot 2017-06-03 at 7.49.16 PM

Figure 7 shows a side-by-side comparison of our results with Figure 2 displaying the original results from experiment 2 in the paper. In our experiment we only reproduced the curve for the DCell topology. Our plot shows an expected decrease in overall throughput due to the fact that we reduce the number of bytes sent from each host to one another from 5GB to 250MB.

Screen Shot 2017-06-03 at 7.49.33 PM

V. Challenges

In the paper, the authors present formalisms for identifying switches and pseudo-code for both building the DCell topology and implementing routing. However, while the DCell topology is meant to be flexible enough to allow for any n (number of hosts in one DCell0) and any l (number of levels of DCells), many of the protocol descriptions were specific to l=1. For example, when specifying routing in section 4.1.1, the authors specify the GetLink function in terms of switches with 2 ids, which is only the case when l=1. While these discrepancies did not pose a serious issue in reproducing their results because the original experiments were run with n=4  and l=1, following their hard-coded descriptions reduced the flexibility of our code.

On a more practical level, our greatest challenge was implementing the POX controller. The POX documentation is sparse, and there is no comprehensive list of event handlers, flags, and state descriptors. Thus, we often had to dig through the POX source code to find the handlers and flags that we needed. Moreover, POX when simulated through Mininet is relatively finicky, especially when run on VirtualBox. However, once we switched to an instance on Google Cloud, our results became significantly more consistent.

VI. Conclusion

In total, while scaled down, our results matched the trend of the outputs in the paper. Specifically, we found that DCell quickly recovers from link/server failure (experiment 1) and congestion on a DCell topology does not result in extended delay times for all flows at once (experiment 2).

VII. Reproducing Our Experiment

  1. Create a Google Cloud instance running Ubuntu/Linux (we used n1-standard-4, with 4vCPUs and 15GB memory).
  2. Install mininet (we found that the easiest way to do this is to install from source, following the instructions on: http://www.brianlinkletter.com/how-to-install-mininet-sdn-network-simulator/
  3. Install matplotlib (run sudo apt-get install python-matplotlib)
  4. Clone our repository (run https://github.com/MitchellDumovic/cs244-assignment3.git)
  5. Navigate into the repository (run cd cs244-assignment3)
  6.  In the home directory of the repository is a script called run_experiment.sh. This script will run both experiments and plot the results. Note that in order to run this experiment, you need root access for the instance. To run the script, run:
    1. chmod +x run_experiment.sh
    2. ./run_experiment.sh
  7. The script will generate 3 figures into the repository’s home directory. The figures are:
    1. throughput.png: (reproduction of Figure 5 above)
    2. alternative-throughput.png: (reproduction of Figure 6 above)
    3. experiment_2_throughput.png: (reproduction of Figure 7 above)

One response to “CS244 ‘17: Dcell: A Scalable and Fault-Tolerant Network Structure for Data Centers

  1. Reproducibility score: 5
    Besides the alternative-throughput (fig6) plot that was slightly different after it booted up at the end (past the 104s mark), the main plots match pretty much exactly in both scale and shape. Some reproduction directions suggestions: perhaps put all of the setup steps (installing mininet and matplotlib) into a script file to make it easier on the reviewer, and while it is nice that the program gives periodic status updates on what’s going on and the runtime is not very long, it would’ve been nice to be provided a rough estimate of the total runtime of the program so I know what to expect in advance. Your project was interesting and definitely shows both the high throughput and resilience to failure aspects of the DCell design.

    – Melanie Ren

Leave a comment