CS244 ’13: DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers


Team: Jim Liu and Michael Chang

Key Result: DCell topology allows quick recovery from link and node failures.

Source: http://research.microsoft.com/pubs/75988/dcell.pdf

Contacts: Jim Liu (jimliu at stanford dot edu), Michael Chang (mchang91 at stanford dot edu)

Introduction

One of difficult challenges in data center networking is how to efficiently connect an exponentially increasing number of end hosts. DCell presents a unique network structure for data center networks. The structure itself is recursive, where large DCells are constructed from small DCells. At each level, nodes are fully connected. There is also one connection between each DCell at the same level. This structure provides a large amount of fault tolerance and has close to shortest path routing.

For our project, we replicated Figure 11 from the DCell paper written by Guo et al. In the paper, a physical testbed along with physical disconnections of links and nodes were performed to demonstrate the robustness of a DCell network. Instead of a physical testbed, we used Mininet to create the topology and simulate link/node failures. RipL-POX was used to implement the routing between different DCells. We demonstrated whether an emulated topology using Mininet on EC2 is sufficient to maintain robust throughput after links and nodes are removed. It is also interesting to note how much throughput was achieved.

Goals and Motivation

DCell is motivated by several different design goals for data center networking. These goals are scalability, fault tolerance, and high network capacity. Data center growth is becoming exponential and high bandwidth requirements mean existing tree structures for DCN have difficulty meeting these design goals. The paper proposes DCell as a recursively defined network structure for data centers and proceeds to show its efficacy for DCN through experimental results.

DCN as a whole is an interesting problem because of the data center’s growing importance in the Internet landscape. Cloud based computing companies hold more of our data than ever before and existing tree based structures run into bandwidth bottlenecks at the top of rack and core switches. To sustain a tree structure, more expensive higher capacity switches are needed as the number of servers grows exponentially. Finally, tree structures inherently have a single point of failure at the core switch level. All of these issues motivate the authors to propose DCell as a means of alleviating the challenges of DCN.

DCell topology

DCell topology

Results

The authors of DCell analyzed several different aspects related to the topology and the routing protocol. One result is looking at DCell Fault-tolerant routing (DFR) vs Shortest-Path routing (SPF) under both node and link failures. For node failures, the path failure ratio between the two routing protocols are very close. Under link failures, however, SPF has an almost zero path failure ratio compared to increasing path failure ratio for DFR. This is because DFR is not optimal at the global level.

Experiments were also run on an operational testbed of 20 nodes in a DCell1 configuration. A TCP connection was set up between nodes [0,0] and [4,3], with the initial path consisting of [0,0], [0,3], [4,0], [4,3]. Link failure was simulated by manually unplugging the link [0,3], [4,0] and replugging it several seconds later. The server was then shut down to simulate the impact of node failure. For both cases, the resulting path is [0,0], [1,0], [1,3], [4,1], [4,3]. The resulting throughput plot presented by the paper indicates a very quick recovery after link failure, followed by a slightly longer recovery after node failure.

TCP throughput with node and link failures

TCP throughput with node and link failures

Subset Goals and Motivation

For our project, we chose to reproduce Figure 11 from the paper that demonstrates TCP throughput with node and link failure. The main reason was to validate the robustness of the DCell topology and local reroute protocol. Additionally, we were interested in seeing how Mininet/POX responds to traffic simulations on a fairly large topology. Because it would be infeasible to acquire 20 physical servers to run the experiment, we used Mininet to simulate the hosts. For our topology and routing simulation, we used RiplPox since it provides an easy means of interfacing with the POX OpenFlow controller and provides a good introduction into SDN as a whole. The graph from the paper indicates a very quick recovery so performance in a simulated environment would also be interesting to validate.

Subset Results

Our results were similar to those seen from the paper. Throughput drops for both link and nodes failure and recovers quickly at 34s and 108s. There were differences, however, between the environments from the paper and our experiment, which we describe in the Platform section below.

Experimental throughput with link and node failure

Experimental throughput with link and node failure

plot2

Experimental throughput for two different links as link state changes

Platform

Our DCell setup under Mininet closely resembles the description in the paper, with a few important differences. First, because hosts in a DCell also act as routers, we represent a host as a host/switch pair. This allows us to implement the routing algorithm in OpenFlow, rather than as software on the host. Also, due to limitations in Mininet, we noticed that TCP traffic could not achieve line rate when links were configured with a bandwidth of 1Gbps (they generally could get to about 500 Mbps), so we reduced link speeds to 100Mbps, except for those directly connecting a host to its switch.

We implemented the routing algorithm using RipL and RipL-POX. RipL provides a simple interface to define routes between any two nodes in a network, and RipL-POX acts as an interface between RipL and POX, a controller for OpenFlow switches. This allows us to write our routing algorithm assuming a high-level view of the network, without considering each individual switch and what entries need to be added to its flow tables.

In addition to the DCellRouting algorithm described in the paper, we implement local-reroute to handle link and node failures. Because we have a centralized view of the network, we do not need to implement the link state algorithm to update each switch’s view of the network.

Also, as mentioned above, Mininet does not seem to easily support the notion of directly failing a node or switch. Therefore, we model this behavior by failing the link connecting the switch to the network (failing one link is sufficient to eliminate the path), and waiting the 5-second “link state timeout” used in the paper. This timeout represents the time it would take for neighboring nodes to realize that the failed node is no longer responding, and update their state in response.

Challenges

We encountered a number of challenges while attempting to reproduce the results of the DCell paper. First, in terms of the algorithm itself, we were slightly confused by the paper’s description of local reroute. The paper states, “Now assume an intermediate link (n1, n2) has failed. Local-reroute is performed at n1 […].” However, in the description of the new path after the link ([0, 3], [4, 1]) has failed, the paper says that [0, 0] will send the packet to [1, 0] directly. It seems that local reroute would have to be performed at [0, 3], meaning the packet will go from [0, 0] to [0, 3], then back to [0, 0] to be routed to [1, 0]. We chose to implement the path described in the experiment, as we were concerned that [0, 0] -> [0, 3] -> [0, 0] would cause a routing loop.

The most obvious implementation challenge we faced was in simulating node failures. We could not find an obvious way to disable a host or switch in Mininet, even though it seemed that the OpenFlow controller would be able to easily respond to such an event.

In addition to this, most of our implementation challenges involved figuring out how to adapt RipL and RipL-POX for our needs. RipL-POX appears to be implemented assuming certain characteristics about the topology, such as a notion of “layers.” While this makes sense for the Fat Tree topology implemented in RipL, it wasn’t immediately obvious how this model could be adapted for our topology. (In the end, we identified precisely the functionality that RipL-POX required from our topology and implemented only this functionality.)

Another issue that we did not anticipate is keeping POX and Mininet synchronized. Since POX runs in a separate process, it maintains a separate instance of the topology, which must be kept up to date with Mininet’s view of the topology. We solved this problem by synchronizing the two processes using files: when an event (like a link failure) was about to occur, one process would create a file, and the other would wait for the file to become readable. This allows events to occur in Mininet and POX almost simultaneously.

Our biggest implementation challenge, and the one that took the longest to resolve, was making sure that switches correctly responded to failures. Because the current implementation of RipL-POX does not check for link failure events, we needed a way to inform the controller that links went up or down. We decided to handle this in the topology, rather than explicitly writing a handler for OpenFlow events in the controller. When the experiment asks the topology to disable a link, the topology informs the controller, which should inform the switches.

We thought that we could simply flush the entries of the switch flow tables to handle this, but that turned out to be insufficient. It seemed that packets already in the network would get confused and would not be routed correctly. This was especially a problem for link up events: with link down, packets would simply disappear when they traverse the failed link, but when paths need to change with all links up, packets seemed to get stuck in loops, completely stalling the network. We finally resolved this issue by keeping track of all of the flow entries we installed into the switches. After clearing the flow tables, we reinstalled all of these entries with information about the new route.

Analysis of Results

Our results confirm the authors’ argument that DCell is a fairly simple topology that can provide aa high degree of fault tolerance in the event of link and node failures. Of course, these benefits need to be traded off with the costs of DCell, including potentially higher latency due to the multiple-hop paths and the increased CPU load required due to hosts acting as routers.

One interesting aspect that our simulation implies is the potential benefit of SDNs for DCell. Because of the centralized view of the network that SDNs provide, the DCell routing algorithm could be simplified, as link state information would no longer need to be propagated. Furthermore, switches would not have to wait the 5-second timeout in order to detect node failures; they would be able to respond almost instantaneously.

Steps to Reproduce our Results

These steps should be performed on the CS244 Mininet EC2 instance. We have successfully performed this experiment on a c1.medium instance.

1. git clone https://bitbucket.org/seally_1186/dcell.git
2. cd dcell
3. git submodule update –init
4. ./run.sh
5. Wait until “Done with experiment. Please type quit().” is displayed. (This may be followed by “Killed.”, which can safely be ignored.)
6. Type “quit()”.
7. The plots are generated in plot.png and plot2.png.

Advertisements

One response to “CS244 ’13: DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

  1. We ran the experiment as detailed in the “Steps to Reproduce our Results” section and were able to obtain the graphs generated by the project team. In some of our runs, the network does not recover from initial link failure, showing a transmission rate of 0 Mbps for approximately 70 seconds (between 45s and 115s). In all but one of our remaining runs, the network does recover to equilibrium transmission rate, but rate frequently oscillates with an amplitude of 3 – 5 Mbps around an equilibrium rate of 95 Mbps. These graphs closely resemble the graph from the original paper. Only one run (out of ten) shows the stable transmission rate that matches the group’s plot.

    The experiment’s instructions were very easy to follow and run. One recommendation would be to use the –no-cli flag when starting the POX controller, and then killing it programmatically when the experiment ends.

    Reproducibility score: 3.5

    – Ryan Wilson and Brian O’Connor

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s