Team: Jitendra Nath Pandey and Raman Subramanian.
Key Result(s): Demonstrate the fault tolerant routing scheme of DCell by reproducing Figure 11 from the original DCell paper.
 Original Paper: DCell: A scalable and fault tolerant network structure for data centers. Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi,Yongguang Zhang, Songwu Lu. SIGCOMM 2008.
 Related presentation: DCell: A scalable and fault tolerant network structure for data centers.
Contacts: Jitendra Nath Pandey (firstname.lastname@example.org) and Raman Subramanian (email@example.com)
Data center networks must scale to large number of servers and allow for incremental expansion. The network must not only be tolerant to failures but also meet the demand of high bandwidth by large-scale applications such as distributed file systems and map-reduce clusters.
DCell is novel network structure that addresses the above requirements. It consists of the following components: DCell network structure, distributed and fault tolerant routing and incremental upgrade scheme to allow gradual expansion. In DCell, a server is connected to several other servers and a mini-switch via bidirectional links. Higher level DCells are recursively constructed from lower level DCells. DCells at the same level are fully connected with one another. DCell uses a decentralized fault-tolerant routing protocol (DFR) that effectively handles various failures.
Figure 1 show the DCell network topology used in the paper to generate their results. The small circles represent nodes (sometimes referred in this blog as servers or hosts) that do some of the routing as well. The bigger circles represent the higher level DCell. Each DCell contains a mini-switch represented by the rectangular box.
The graph to replicate is shown below in Figure 2. In this experiment, a TCP flow is set up between nodes 0.0 and 4.3 (Refer to Figure 1 for the topology) and throughput is plotted as seen at node 0.0. This graph demonstrates that DCell’s fault tolerant routing quickly recovers from a link failure to the original throughput. Throughput recovers to the original level from node failure as well, after a delay equal to the link state timeout value in the switches. The switches broadcast their link state to all other switches within the local DCell. Therefore a link failure is quickly detected and recovered. A node failure is detected only after the link state timeout value stored in the switches times out, therefore there is a delay of roughly 5 sec. In the paper all links were setup a with 1000Mbps bandwidth.
We implemented dcellpox as our OpenFlow controller using POX framework in python. We implemented exactly the same topology as used in the paper but using Mininet. In DCell the hosts do the routing as well. But in Mininet it was cumbersome to set up the hosts for routing, therefore we modeled every node of the paper with a host-switch-combination, where the switch associated with the host did the necessary routing. The resulting topology is shown in Figure 3. There are 5 DCells and each cell has 4 nodes. In the host-switch-combo the server is represented by the oval shape and the associated switch by the rectangular box. Each DCell has one mini-switch connecting the 4 host-switch-combos.
The paper uses 1000 Mbps links, but Mininet cannot provide such high sustained bandwidth for many links and hence we have scaled the link bandwidth to100 Mbps. As expected, the results show that the throughput was scaled down by a factor of 10 as well compared to the graph in paper.
The DCell paper describes a basic routing protocol and a fault tolerant routing protocol. We implemented both in python in our OpenFlow controller dcellpox. Specifically, we implemented local re-route mechanism when a link failure is hit. We did not implement link state broadcast because our controller has access to the global state. We also didn’t implement handling for jump-up for rack failure because this experiment needed only link failure and node failure.
iperf was used to send traffic from node 1_1_1 to 1_5_4. Node 1_1_1 corresponds to 0.0 in the paper and 1_5_4 corresponds to the node 4.3 in the paper. The normal route is shown by the green line in Figure 3. As done in the paper, we bring the link from 0_1_4 to 0_5_1 down at the 34th second. The alternate route is shown by the red line in the same figure. We restore the link back up at 42nd second and the original route is restored. At 104th second we emulate bringing down of node 0_1_4 by bringing down all the links connected to it. The route shown by red line is again setup as the alternate route. The node is not restored back up as done in the paper.
In a OpenFlow controller, a link failure or node failure is quickly detected by the controller (through the use of PortStatus command that is sent to the controller), but in the actual DCell implemented in the paper there is a delay. To emulate this delay we added a sleep of 1 second for link failure in the controller. A sleep of 5 seconds was used for node failure which corresponds to the link state timeout used in the paper. We later re-ran the experiment without these delays to demonstrate that routing using an OpenFlow controller gives much better results.
Figure 4 shows our reproduction of Figure 11 from the paper. We plot the throughput as seen at the source node. The comparison between the 2 figures is shown in Figure 6. The first big drop in throughput is at 34th second corresponds to our link failure, but the recovery was fast as in the original paper. The second big drop is at 104th second which corresponds to the node failure. The recovery takes around 5 seconds matching the link state timeout. The maximum throughput is around 90Mbps showing a 1/10 scaling as compared to the original graph. This is expected because we used 100Mbps links as compared to 1000Mbps links used in the paper. Thus our results are very similar to those in the DCell paper.
To demonstrate that the alternate routing was indeed triggered we also plot in Figure 5, the throughput seen at switch 0_2_1 which is in the alternate path. Figure 5 clearly shows that for link failure, the alternate route was picked for around 8 seconds which was about the duration for which the link was down. The traffic went back to the original route once the link was up. At node failure, alternate path was again installed for remaining duration of the experiment.
In a OpenFlow controller, any link or node failures can be detected easily because switches report the port status to the controller. However, in DCell a node failure is detected only at the link state timeout in the other nodes of the DCell. We repeated the above experiments without introducing any delay in our controller. The goal was to see if routing using an OpenFlow controller improves the recovery time. Significant improvements are expected because OpenFlow controller maintains a global view and state of the topology. The experimental results confirm this as shown in Figure 7. There is a drop in throughput but recovery is much faster as compared to original DCell design. Figure 8 is similar to Figure 5 confirming that alternate route was indeed picked.
The DCell paper used links with 1000Mbps bandwidth, however Mininet was not able to provide such a high bandwidth for a sustained period of time, therefore we had to scale down the bandwidth to 100Mbps in our experiments. Also a ‘c1.medium’ instance of EC2 AMI was sufficient to reproduce Figure 4. We also tried ‘c1.large’ instance and the plot we got was much smoother than what we have in Figure 4.
DCell uses a distributed routing algorithm while we used a centralized controller. This was a key difference between the setup used in the paper and our setup. The use of OpenFlow controller significantly simplified the routing implementation because the global topology and link state was available at the controller but in the original DCell implementation it was implemented as a distributed routing algorithm with broadcasts so that neighboring switches are aware of node states. The OpenFlow controller not only simplified, but also significantly improved the performance of fail-over routing as shown in Figure 7 and 8.
The use of OpenFlow APIs using POX framework were very easy to use. However, the timeouts associated with the installed flows required to be tuned to reduce unnecessary traffic to the controller.
Instructions to Replicate This Experiment
1. Create an EC2 instance using AMI: cs244-mininet-mptcp-dctcp (ami-a04ac690) (on West Coast(Oregon)). A c1.medium instance will suffice.
2. Log in to the EC2 machine.
3. Make sure there is no file or directory named ‘pa3’ or ‘pox’ in the home directory.
4. git clone git://github.com/noxrepo/pox.git
5. git clone https://bitbucket.org/spraman/pa3.git
6. cd pa3
7. cd dcell
8. sudo python setup.py develop
9. cd ../dcellpox
10. sudo python setup.py develop
11. cd ..
This will take around 5 minutes to run. The output would be created in a directory named like MonDD-hh-mm (e.g. Jun04-02-00). There will be two png files in this directory
- dcell_rate_0_1_1-eth1.png : This is the reproduction of the original graph depicting the throughput at source node amidst failures (Figure 4)
- dcell_rate_0_2_1-eth3.png: This shows the throughput in alternate path. (Figure 5)