Diveesh Singh, Jean-Luc Watson
Original Paper: Liu, Vincent, et al. “F10: A Fault-Tolerant Engineered Network.” NSDI. 2013.
In this project, we reproduce the results published in F10: A Fault-Tolerant Engineered Network by Liu et al, namely, that a switch topology co-designed with fault recovery protocols can robustly maintain connectivity even after experiencing many switch failures. Modern data centers form the backbone of cloud services, and thus require high availability at minimum cost. A common switch topology that mostly addresses these concerns is the “FatTree”: the resulting network is highly scalable, cost-efficient, and contains many redundant links that can provide fault tolerance in the case of a switch or link failure. Unfortunately, the F10 authors identify that in a number of scenarios, a FatTree, designed without serious consideration for fault tolerance, results in suboptimal network performance. Specifically, because the topology is symmetric, any switch attempting to route down through a failed child cannot route through any of its other children because they will in turn attempt to route through the faulty switch. This forces the affected data center to make use of expensive, long rerouting paths; such a system may not be able to respond quickly enough to prevent connection loss.