Team: Wendy Mu and Bridget Vuong
In datacenter networking, it is hard to ensure good performance across a variety of flow sizes, especially when latency-sensitive short flows are mixed in with lower-priority long flows. In “Deconstructing Datacenter Packet Transport,” the authors present pFabric, a simplistic datacenter fabric that optimizes for flow completion time. The key insight of the paper is to decouple flow prioritization from rate control by explicitly encoding a priority number into every packet and using switches that implement a priority, rather than first in, first out, queue. The priority of each packet is dictated by the remaining flow size, meaning that latency-sensitive short flows will be prioritized over long flows. With this fabric design, datacenters can achieve near ideal performance with only small buffers and minimal congestion control .
Many datacenters today simultaneously have latency-sensitive short flows and longer background flows. For example, in a typical search datacenter, each user query will generate many short flows. Additionally, there are longer flows running in the background, performing tasks such as indexing. While it is important for the short flows to complete quickly in order to deliver search results to the user, in practice, these short flows compete with long flows from background tasks like indexing, resulting in poor completion times.
The authors found that pFabric with minimal or no rate control achieved average flow completion times that were close to ideal across all flow sizes. For short flows, the average flow completion time grew minimally with load with pFabric, a significant improvement over regular TCP or DCTCP. However, for long flows, the average flow completion time grew similarly with load for pFabric, DCTCP and TCP . These results demonstrate that pFabric significantly improves the performance of latency-sensitive short flows without significantly affecting the longer background flows of data centers.
4. Subset Goal
We attempt to reproduce the results of Figures 2, 3, and 4 of the paper, which graph the flow completion time versus load for data mining and web search workloads. Figure 2 compares the overall average normalized flow completion time under TCP, DCTCP, minTCP with pFabric, line rate with pFabric, and the ideal theoretical values across typical data mining and web search workloads. Figures 3 and 4 are expansions of Figure 2, breaking down the statistics across different flow sizes. We will only be reproducing TCP and a variant of minTCP with pFabric in our experiments, however.
5. Subset Motivation
We only replicate the figures for TCP and a variant of minTCP with pFabric mainly because of time constraints. Figuring out DCTCP is a project in itself (in fact, another group is doing this project), as is writing a simulation to calculate the ideal flow completion time using Shortest Remaining Processing Time. We believe that the comparison between TCP and minTCP with pFabric represent the main findings of the paper anyway since it demonstrates the effectiveness of pFabric at achieving close-to-ideal flow completion times.
6. Subset Results
Above, we show our reproductions of figures 2, 3, and 4 from the paper, respectively. All experiments were run with link speeds of 100Mbps with no added delay over a 15-host star topology. The graphed flow completion times are all normalized to the best possible completion times (assuming no congestion) as in the pFabric paper. For the web search workload, we ran the experiment under each load for 3 minutes. For the data mining workload, we ran the experiment under each load for 5 minutes. The results are as we expected, and pretty similar to the results in the paper. From the graphs, we can see that for short flows, minTCP+pFabric performs significantly better than regular TCP, but for the largest flows, performance is about the same, if not worse, for minTCP+pFabric. This makes sense because pFabric prioritizes short flows over long flows. In general, the experiments with the web search workload seemed to have smoother and more reasonable results than the experiments with the data mining workload. This could be because the data mining workload is much more skewed, with flow sizes ranging from 1 to 666667, whereas the web search flow sizes only ranged from 1 to 20000. One main difference between our reproduced graphs and the graphs in the paper is in the magnitude of the normalized flow completion times; this could be explained by the difference in topologies or by the difference in number of priorities. In the paper, nodes were connected in a FatTree topology where TCP was used in conjunction with packet spraying, but for our experiments, we used a star topology with only a single path from one node to any other node. We were also limited to a maximum of 16 priority buckets for the PRIO queueing discipline, while the paper assumed an unlimited number of priorities, one for each integer value of remaining flow size. For our experiments, the limited number of priorities meant that a flow with remaining size 2 may not necessarily be serviced before a flow of remaining size 3 because they map to the same bucket, thus increasing the average normalized flow completion time. Additionally, the graphs for TCP were not as smooth as the graphs in the paper, but this could be due to differences in run times of the experiments. We believe that if we had run the experiments for a lot longer, the graphs would look smoother. However, with 3 and 5 minutes per load, we could already see the large difference between regular TCP and minTCP+pFabric.
7. Implementation and Challenges
We broke down the implementation of this experiment into 4 steps.
- Implement priority queueing at the switch. We added support for using the PRIO queueing discipline with 16 priority bands via tc commands in Mininet’s link.py. We added tc filters to match the encoded priorities in our packets and also to filter all other packets such as acknowledgements to the highest priority.
- Generate web search and data mining workloads that write priorities into packets. We wrote simple client and server programs using Python sockets that send and receive flows according to a Poisson process based on either a data mining or web search workload distribution. For each flow, the client spawns a new subprocess that sends the packets, with each packet’s priority based on the remaining flow size. Each server spawns a new thread for each incoming connection to process the packets for that flow. We run these programs on each host in our Mininet topology.
- Configure TCP to use the minimal congestion control and RTO estimation features. We used sysctl parameters to change our congestion control algorithm from Cubic to Reno and to disable any other advanced TCP features. We were not be able to replicate the minTCP that the paper uses, however, due to the limited parameters that sysctl exposes.
- Create a datacenter topology over which to run experiments. While the paper ran experiments over a k=6 fat-tree topology, we ran our experiments on a simplified star topology with all hosts connected to a single switch, mainly due to time constraints. We do not believe that this simplified topology should affect our results too much, however, since from a host’s perspective, a fat-tree topology can be abstracted as a single, large, high-bandwidth switch to which all other hosts are connected.
The biggest challenge for us was to pick good parameters for running our experiment. Since we had to scale down in multiple dimensions, it was hard to determine how many hosts to use, how much we should scale down the workload distributions, and how long to run the experiment for at each load level. We found that our final results graphs depended a lot on these choices partially due to the memory and hard drive space limits of the machine. In the end, we decided not to scale down the distributions but to only run with up to 15 hosts.
We found that the main thesis of the paper, that priority queueing using pFabric provides a simple way to decrease average flow completion times especially for small flows and without much impact on large flows, does hold true. However, the amount of improvement is highly sensitive to the parameters of our experiments, i.e. the number of hosts and the distribution of priorities. One assumption that the paper made was that there was essentially an unlimited number of priorities available for priority queueing. However, in practice, such large numbers of priorities are not always available. Since the main performance gain of pFabric depends on priority queuing, fitting 666667 (the maximum number of packets in the data mining workload) priorities into 16 buckets will inevitably lose some of the flow ordering that would have resulted had we had an unlimited number of priorities. We tried splitting the priorities in several different ways (e.g., linearly, logarithmically, based on workload distribution) to achieve the best performance. This difference could explain the difference in normalized flow completion times for our runs versus the original paper’s runs.
Although the paper uses ns2 to conduct its experiments, we decided to use Mininet with OpenFlow/Openvswitch running on Linux. Although ns2 offers more control, Mininet allows for more realistic simulation. The experiment in the paper used minTCP, a variant of TCP with all advanced features such as fast retransmission and RTO estimation turned off. Additionally, their topology consists of 54 hosts, 10Gbps links with 12us RT delays between hosts, and essentially an unlimited number of priorities in its priority queues. In Linux, we could not easily turn off many features of TCP such as fast retransmit, duplicate ACKs, etc. Instead, we opted for using TCP Reno and turning off the advanced features of TCP that were available via system control variables in Linux. We had trouble running Mininet with 54 hosts due to memory constraints, so we scaled it down to 15 hosts. Mininet could not emulate 10Gbps links, so we scaled down our link bandwidth to 100Mbps. The priority queueing discipline did not work in conjunction with netem, which Mininet uses to emulate link delay and queue sizes, so we ran our experiment with no added delay on the links and modified txqueuelen to set the maximum queue sizes. The PRIO qdisc only has a maximum of 16 priority classes, so in order to partition our traffic into 16 classes, we scaled the remaining flow sizes down logarithmically into the 16 buckets.
10. Reproducing Experiments
- Create a c1.xlarge EC2 instance using the CS244-Win13-Mininet ami.
- git clone https://github.com/bvuong/CS244PA3.git
- Run sudo ./setup.sh in the ~/CS244PA3/ directory to set up the environment (You may have to run chmod a+x setup.sh to set the appropriate execute permissions)
- To run our experiments, run sudo ./run.sh in the ~/CS244PA3/ directory. Inside the two directories (named <workload>_<num hosts>h_<runtime per load>s/) created, the plots total.png and breakdown.png show the results for all flows and for bucketed flow sizes, respectively.
Note: Our experiments take approximately 2.5 hours to run. Running the experiment again will delete previous results unless the previous results folder is renamed. If you keep around many directories of previous results, be warned that you may run out of disk space part of the way through the experiment!
1. M. Alizadeh, S. Yang, et al. Deconstructing Datacenter Packet Transport. In Hotnets 2012.