DCell : A Scalable and Fault-Tolerant Network Structure for Data Centers

Team: Camille Lamy, Eric Mibuari, and Omid Mashayekhi

Key Results:

The DCell topology is resilient to link/node failures and provides higher bandwidth under heavy load than a Tree topology, which leads to better flow completion time.

Original Paper: DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi,Yong guang Zhang, Songwu Lu. SIGCOMM 2008.

Contacts: ( all mail addresses are in stanford dot edu)

Camille Lamy : email – clamy, Eric Mibuari : email – mibuari, Omid Mashayekhi : email – omidm


DCell was introduced in 2008 as a structure to efficiently connect a large number of servers in a data center. As the number of servers in data centers grows exponentially, this problem has become increasingly important. The authors of DCell have created a distributed recursive structure to try to address this problem. The distribution of routing ensures that no single point of failure exists, which gives the system more resilience to link and node failures. Through its recursive and hierarchical characteristics DCell is also able to provide scalability. Moreover, the structure of the DCell lends itself to a recursive routing algorithm close to the shortest path routing.

A DCell topology for 5 Cells of level 0, each containing 4 servers

The DCell structure is composed of basic units called DCells. Starting at level 0, n hosts are connected to a switch to form a level 0 cell. Each of the hosts is then connected to different level 0 cell. This means that n+1 cells of level 0 are interconnected, and together they form a level 1 cell. Cells of a superior level can then be constructed recursively. This structure is illustrated by the diagram above extracted from the DCell original paper.
In order to support their system, the DCell authors did two experiments that we set out to replicate. The first experiment (results on Figure 11 in the paper) shows the resilience of DCell when faced with link or node failures. In this experiment, a link is first brought down, and later brought back up. Later on, a node is brought down. In each case, the impact on throughput is minimal, and the topology is able to cope with both types of failures very quickly.
The second experiment (Figure 12) is a comparison of flow-completion time and rate between DCell and a standard tree topology. 20 hosts initiate a connection to each of the other nodes to transmit 5 GB of data (total of 380 flows and 1.9 TB shuffled). This pattern is repeated on a tree topology. The throughput rate of each connection is then plotted as a function of time. DCell achieves twice the throughput rate of the tree topology and it only takes half as much time to transmit all the data.


We used Mininet to replicate these two experiments. We coded up the topology with building blocks of level 0 DCells. In order to have connectivity, we had to develop our own controller using OpenFlow. We used POX as the base controller and created a new module to implement the specialized routing algorithm for DCell, using the Python language. The controller acts based on the method described in the paper and has the capability to detect link and node failures and reprogram the switches with new routing information. If the two communicating servers are in the same DCell, then we just forward the message using the switch of that DCell. If they are not, we find which host has a link to the destination DCell and send it the packet. This host will forward our packet to the destination DCell. This is exactly the algorithm described in the paper.

Implementation of DCell with a switch attached to each host

However, because hosts cannot forward packets in Mininet, we had to attach to each of them a switch that would handle the forwarding. Apart from having a switch handle the connections for each host, we have implemented the exact topology described in the paper. However, during the second experiment we had to use 4 DCells of 3 servers each instead of 5 DCells of 4 servers each because our controller could not handle the load. However, the DCell topology is supposed to be general, with n+1 connected DCells of n servers forming a building block, so we believe our different number should not change the behavior of the graph.
The tree topology uses the default L2 behavior built in Mininet. As there is no loop in this topology, it was possible to use this mechanism, which implies flooding the network with packets when the mapping between a MAC address and an IP address is unknown. Our tree topology consists of a two level tree with one root switch and four mini switches each of which is connected to four nodes.
For the second experiment, we set up iperf receiver and sender processes at each of the hosts and then we started transmitting data packets from each of the 12 hosts to the other 11 hosts in the network. We therefore had 132 simultaneous connections.

Link and Node Failures – measuring resilience:

The goal of our first experiment was to observe the behavior of DCell when faced with link and node failures. The diagram below shows our result side by side with the result of the original paper. As you can see, the controller detects the failures and updates switches with new routing information. The process from detection to rerouting is fast enough that the throughput fluctuation is almost negligible.

DCell behavior (throughput) when faced with failures. On top is the original paper’s result and on the bottom is ours.

Aggregate Throughput – measuring performance:

The goal of our second experiment was to compare the aggregate throughput between 12 hosts transmitting 5 MB of data over 1 Mb/s links.

In order to aggregate the throughput measurements we had to come up with a way to make a fair comparison between Tree and DCell. The raw throughput measurements are initially written to a text file and after looking at this text file we discovered the sampling is not uniform over time for all of the interfaces at a switch. In order to aggregate the measurements we took the sum of all the throughput values for each host input port from both topologies and plotted these on the same graph.

The results are shown in the graphs below.

Comparison of DCell and Tree aggregate throughput
The original graph(top) is for parameters : 20 hosts, 1 GB links, 5 GB transmitted by connection and our graph (bottom) is for parameters : 12 nodes, 1 MB links, 5 MB of data per connection

Based on our results, we believe that we have managed to reproduce the general behavior that was described in the original paper, and that our experiments support the findings of the DCell authors. The scale is different because we did not manage to support the load of the n*(n-1) flows in Mininet. In the following section we explain this issue further.

Lessons learned:

We found that the paper was very clear and easy to read and understand. The topology was well explained, and so was the routing algorithm. The whole paper is based on the idea of a decentralized routing algorithm, which allows for scalability of the system. However, given the time constraints, we had to implement routing in a single controller, written in Python using POX. This led to very poor scalability, and indeed in our experiments the controller was unable to cope with the load of 380 simultaneous flows and crashed when tested using the experimental parameters  of the original DCell experiment.

The main reason for this problem was that the controller could not run fast enough and, because of the excessive latency in the communication with switches, there are false switch-failure detections. Out of the dozens of time we ran the experiment, the controller was only able to handle 380 flows on only two or three occasions. We have tried to find the most reliable parameters for the experiment; however the curve’s smoothness still depends on the congestion of the machine on which the experiment is running.
In order to accommodate the controller, we reduced the number of nodes and the link bandwidth used in our experiments. We had first thought of reducing the size of the data exchanged relative to the bandwidth, but this did not work. In our design, all hosts are going to flood the controller with route requests. Given that the controller was written in Python, it takes a long time for it to serve all the requests it gets, leading to the flows effectively starting on an extended period of time rather than precisely at the same moment. If the data they transmit is too small, then some flows will finish before others have started. In the case of 380 flows, we have observed latency for the flows as big as 20s.

Because of these controller latency issues, we had to cut bandwidth by a factor of 1000 so that the controller could handle the load. This also led to an equivalent 1000 factor reduction in transmitted data size. We also had to reduce the number of flows to a third of those in the original experiment (380 to 132). The final result of these changes is that we had 12 hosts generating a total of 132 flows, with each flow transmitting 5 MB of data over 1Mb/s links. These parameters are on a scale much smaller than what exists in data centers today. But this modification allowed us to complete our project in a timely manner. If we had had more time, implementing a fully decentralized controller in Mininet would have probably led to better results than using a centralized controller running in POX.

Replication instructions :

All experiments were tested using the 12.04 AMI provided by the course staff situated on the Oregon Amazon Web Clusters. If you are using it, please disable MPTCP and DTCP using the following commands :

sudo sysctl -w net.ipv4.tcp_dctcp_enable=0

sudo sysctl -w net.ipv4.tcp_ecn=0

sudo sysctl -w net.mptcp.mptcp_enabled=0

Please download and install the pox controller and riplpox module as follows:
== Install ==
RipL, POX, and setuptools are the dependencies.

# Build RipL:
sudo apt-get install -y python-setuptools
git clone git://github.com/brandonheller/ripl.git
cd ripl
sudo python setup.py develop

# Building Ripcord-POX
cd ~/
git clone git://github.com/brandonheller/riplpox.git
cd riplpox
sudo python setup.py develop

# Building POX:
cd ~/
git clone git://github.com/noxrepo/pox.git

== Running on POX ==

A tarball containing our code can be found at https://github.com/clamy/cs244

The main file that contains our code is “cs244-12-dc”. We have  modified riplpox to make it consistent with the DCell. Moreover there are also scripts that runs the experiments. There are also scripts in utility folder for plotting the figures.

Please replace the old riplpox module with ours:

cd ~/cs244-12-dc

mv riplpox.py ~/riplpox/riplpox/riplpox.py

1) Part one – Link and node failures:

– Before anything please run sudo mn -c.

– run the script sudo ./dcell-exp1 in a window. Mininet will set up a topology and will then ask you to press return when all the switches have been connected. DO NOT PRESS IT RIGHT NOW !

– open a new window and enter ~/pox/pox.py riplpox.riplpox . The controller should detect all the switches. When it is done, a message will appear saying “Woo all switches are up”.

– when you see this message, go back to the experiment window and press enter. Wait for the experiment to complete. It should take around three minutes.

– The result is saved as “rate.png” in the created folder.

2) Part two – DCell versus Tree rates:

– Before anything please run sudo mn -c.

– run the script sudo ./dcell-exp2 in a window. Mininet will set up a topology and will then ask you to press return when all the switches have been connected. DO NOT PRESS IT RIGHT NOW !

– open a new window and enter ~/pox/pox.py riplpox.riplpox . The controller should detect all the switches. When it is done, a message will appear saying “Woo all switches are up”.

– when you see this message, go back to the experiment window and press enter. Wait for the DCell experiment to complete. It should take around seven minutes. Then, it will ask you to press return after terminating the controller. DO NOT PRESS RETURN RIGHT NOW !

– Now, in the controller window press control-D to terminate the process. Wait for it to go down.

– Then, in the experiment window press return and wait for the experiment to be completed. It will take about another 14 minutes.

– You should see the results in created plotresults-* folder saved as “rate.png”.

Note: Please be patient. It takes some time for the experiment to be completed. Moreover if you see errors about tcp_probe, do not panic as it is NOT a bad sign.


4 responses to “DCell : A Scalable and Fault-Tolerant Network Structure for Data Centers

  1. The instructions were pretty clear and detailed- So I could get the set up running. I was able to replicate the graphs shown in the blog. It took about an hour for me to set up , run the experiments and collect and analyze the results.

  2. At this point “run the script sudo ./dcell-exp1 ”
    i get
    sudo: ./dcell-exp1: command not found

    Any help highly appreciated

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s