CS244 ’17 pFabric: Deconstructing Datacenter Packet Transport


Team: Saleil Bhat & Melanie Ren

Goals & Motivation

Datacenter transport design is a complex issue because it must handle two competing constraints simultaneously. High fabric utilization is essential, but it cannot come at the expense of flow prioritization, which directly impacts quality of service. In the status quo, latency-sensitive parallel short flows can often get stuck waiting behind resource-intensive, long flows in switch queues. This is highly undesirable for a data center operator trying to provide fast service for its customers. Much of the work done to address this problem has been to handle flow prioritization explicitly in novel congestion control algorithms. In this paper [1], the authors take a completely different approach. They argue that by changing the behavior of fabric switches themselves, one can use very minimalistic congestion control algorithms and yet still achieve better results. Their proposed  switch design, called pFabric, makes use of a priority queue rather than a standard FIFO queue to handle flow scheduling, with the goal of reducing latency for parallel short flows by processing them with higher priority while still increasing fabric utilization and preventing congestion collapse.

Flows are encoded with a single value for priority based on the size of the flow remaining. Buffers are kept very small (1 BDP), and when the buffer is full, incoming packets are dropped unless they have higher priority than another packet in the queue, in which case they will replace the packets with lower priority. Another key point of this design is that since flow scheduling is handled well, congestion control can be kept to a minimum, i.e. all flows are handled at line-rate and only dropped if high and persistent packet drops nearing congestion collapse occurs.

Results

The authors evaluated their design with flow completion time (FCT) as their primary benchmark. They simulated two different classes of dynamic workloads common to data centers (web search and data mining) , and compared the performance of pFabric with two alternatives: regular TCP and DCTCP, a protocol which uses explicit congestion notification to improve performance over regular TCP. Their work showed that pFabric, both when used with either a minimalistic congestion control algorithm and when used with no congestion control at all, greatly reduces FCT as the network load increases except for very large web search flows. The results of their analyses are summarized in figures 2-4[1].

paper1

Main Results from the pFabric paper

paper2

Subset Results from the pFabric paper

Subset Goals & Motivation

We attempted to reproduce the results in Figures 2, 3, and 4 for minTCP+pFabric and regular TCP(+DropTail). We chose this subset so as to contrast the newly proposed architecture with a “control.” We believe the most interesting question to ask is whether the new solution outperforms the current status quo (which is regular TCP) and therefore, whether datacenter operators should change their behavior. We excluded the DCTCP results because it would not be feasible for us to implement both DCTCP and pFabric in the limited amount of time we have. Furthermore, given that (as far as we’re aware) DCTCP is not widely deployed, the DCTCP results are least immediately useful for a datacenter operator.  We originally proposed testing the LineRate+pFabric results as well, but technical limitations of our chosen platform prevented us from doing so [see Implementation Challenges].

Subset Results

figure1

Figure 1: Overall average FCT. Results normalized to best possible FCT per flow size

figure2

Figure 2: Data mining workload: Normalized FCT across different flow sizes

figure3

Figure 3: Web search workload: Normalized FCT across different flow sizes

We used Mininet to perform our experiments. For our setup, we used a 28-host star topology rather than the 54-host fat-tree topology as in the paper. Additionally, we scaled down our link rate to 100Mbps rather than the 10Gbps speed in the paper, with delays of 6us on each link. We used the same empirical workload distribution for data mining and web search traffic as the paper, and ran traffic at each load for 240 seconds. To implement pFabric, we used the Linux traffic control PRIO queuing discipline.

Our results show that for both data mining and web workloads, MinTCP + pFabric generally performs better than TCP + droptail (except for the large web traffic flows, which is the same result as the paper’s). However, our data is a lot less smooth. The four major divergences from the paper are as follows: 1) Our 99th percentile data does not follow the trend of the average and is quite erratic for pFabric. 2) TCP outperforms pFabric for the smallest web traffic loads, but becomes worse as the load increases. 3) We do not observe an increase in FCTs for pFabric as the load increases in the data mining experiment. 4) The differences between the average FCTs and best-case FCTs were much higher in our experiments than they were in the paper’s.

We hypothesize that 2) is caused by the fact that the queues in pFabric case are quite small, so packets are being dropped even for low loads. However, as the load increases, the price of a dropped packet becomes small compared to the price of waiting in a bloated queue and pFabric begins to outperform TCP. It is possible that 4) is a result of the fact that Mininet is an emulator which shares resources with the host OS and as such, there a could be a much higher variance in processing delay.

Implementation Challenges

Topology: We initially tried to mimic the paper by setting up a fat-tree topology. After some research, we discovered Mininet does not support topologies with multi-routes or possible loops, and even the fat-tree topology code with special controllers recommended by the Mininet documentation did not work. In the end, we decided to just use a simple star topology since an abstraction of good datacenter topologies (such as fat-tree) is simply that of one giant switch connecting all hosts. Additionally, an updated version of the paper [2] runs their pFabric experiments over a leaf-spine topology, suggesting that the specific topology used is not that crucial to the design. We cut the number of hosts from 54 to 28 because of memory limitations. We found that the number of hosts does not seem to play a big role in the overall trend of the results.

pFabric priority queue implementation: While the pFabric scheme assumes an infinite (or at least fairly large) amount of priorities that can be assigned to packets, Linux’s traffic control only supports 16 priorities. Because of this, we had to bin our flow sizes into 16 bins so very small flow sizes that are close in size are treated as the same priority instead of having discrete priorities. We experimented with using a logarithmic scale for assigning flow priorities (rather than linear) in order to increase the resolution for small flow sizes, but that made pFabric perform worse. On another note, it is impossible to change the order in which packets are dropped in a Linux queuing discipline. As such, we could not guarantee that the “lowest priority packet is dropped first” property of pFabric was being met.  

Link Bandwidth: We had to scale down our link bandwidth to 100Mbps instead of the 10Gbps used in the paper due to two reasons. First, Mininet simply does not support link rates above 1000Mbps. Second, the inter-arrival time of flows at a sender is a function of the link bandwidth. At higher link speeds, a call to sleep() is not precise enough to get the exact wait time necessary, thus making higher link speeds infeasible to emulate in Mininet.

FCT Measurements: We tried two different methods for measuring the flow completion time. First, we tried doing it entirely on the sender side by making use of the tcp_info struct, which has a field tcpi_unacked. We took a flow to be complete when all the bytes were sent and the value of tcpi_unacked was zero (all packets had been acknowledged). Second, we tried having the sender note the time of the first byte sent and the receiver note the time the last byte was received, then reconstruct the flow completion times at the end. We found that both methods produced similar results.  

TCP/congestion control setup: We used sysctl to modify the TCP settings for our experiments. For the TCP + DropTail scheme, we use the default TCP (Cubic) settings in the system . For MinTCP, we manually disabled duplicate ACKs, fast retransmission, and other newer TCP mechanisms but were unable to turn off timeout estimation since it was not a parameter we could locate in the sysctl settings. This was the only specific MinTCP setting in the paper we were unable to replicate. We wanted to use TCP Tahoe since it has less complex settings, but the system didn’t recognize it so we chose TCP Reno instead as the base congestion algorithm.

Line rate + pFabric: For line rate, we attempted to use datagram sockets to avoid congestion control altogether, but this proved to be infeasible in Mininet. Again, sleep() does not have high enough precision to get the sender to wait for an interval on the order of microseconds (which would be necessary to send at exactly 100 Mbps line rate).

Load: We had to adjust how we loaded the network from the original paper, since we used a different topology. For an NxN switch in which each port has bandwidth B, then 100% utilization would be N*B bits per second. If flow arrivals are a Poisson process with rate lambda, we want the following equation to hold:  E [ flow size] * lambda = load*100*B.

Therefore, lambda = load*N*B / E [ flow size ]. This is the total arrival rate. Since there are 2N senders in a star topology with an NxN switch, lambda on each sender should theoretically be given by : lambda = load*B / ( E [flow size] * N). That is, we should divide lambda by the number of hosts in the network. However, we found that we obtained better results when we did NOT divide by N, i.e. lambda = load*B / ( E [flow size] ). We are unsure as to why this is the case.  

Critique

The main point of the paper that we explored was whether the single change to using a priority queue (rather than a FIFO queue) that prioritized delivery of short flows would be enough to remove the need for complex congestion control. Our experiments affirmed this general result. However, we also found that the system is very sensitive to minor changes. For example, dividing lambda by N (as discussed above) completely changed the trends for both TCP and pFabric. Similarly, switching to a logarithmic priority assignation system also made pFabric perform very differently. As such, we cannot definitively conclude that pFabric would achieve similar results in a production system, where there would be an uncountably large number of  uncontrolled variables.

We believe that limitations in our platform were the main cause of the divergence of our results from those of the original paper. There are many potential sources of error (e.g. the constraints of  PRIO queuing, the inability to use a truly minimal congestion control algorithm) which could have contributed to the difference. However, despite all these issues, we still see a general trend of pFabric outperforming TCP in most cases. This lends credence to the idea that the paper’s thesis still has merit.

Platform

The platform used in the paper was ns-2 but we chose Mininet on Linux instead since we wanted to verify that the design works well regardless of platform and Mininet is a popular network emulator. Unfortunately, we found that Mininet is not a particularly good platform for this experiment (as detailed in the previous section). In hindsight, we probably should have used ns-2. In terms of reproducibility of our project, the main factor that affects the plots is the runtime per load. A long experiment duration is necessary because the size distribution is very skewed, so a shorter runtime will not be representative of all the flow sizes. Additionally since each load runs for a fixed time and then gets terminated, large flow sizes will not have a chance to complete if the run duration is too short. We have the experiment setup to run for 240 seconds per load but it is recommended to run it no less than 180 seconds.

Reproduction

Follow the steps below to reproduce our experiment. Note that you may have to disable sleep and screensaver from your local machine when running the code so that it does not disconnect the SSH session midway (about 2.5 hours to run).

  1. Set up a Google Cloud VM running Ubuntu 14.04 LTS with 2vCPUs and 7.5GB memory.
  2. Run the following lines of code to set up the environment:
  3. Run sudo ./setupenv.sh to install the required dependencies (Mininet, etc.). You will have to answer yes to some prompts during this but it will not take long.
  4. Run sudo ./run.sh to run the experiment. The experiment will take roughly 2.5 hours to complete and plots will be in the outputs directory generated at the end. Repeating the run will delete and overwrite the previous outputs directory.
  5. To retrieve the files to local machine for viewing:
    1. Set up a SSH keypair for the Cloud VM (Or follow instructions here)
      1. SSH: ssh-keygen -t rsa -f ~/.ssh/my-ssh-key -C [USERNAME], where USERNAME matches the username for the instance
      2. Restrict access to the key: chmod 400 ~/.ssh/my-ssh-key
      3. Go to metadata page of project, click on “SSH Keys”, then click “Edit”
      4. Use this command in terminal to obtain key contents: cat ~/.ssh/my-ssh-key.pub
      5. Copy/paste the output as new item in “SSH Keys” list and save.
    2. Download from the VM with the following syntax: scp -r -i ~/.ssh/my-ssh-key USERNAME@VM_EXTERNAL_IP:~/pfabric_v2/outputs YOUR_LOCAL_DIRECTORY

Feedback

Mininet is extremely lacking in configurability and power. Just looking at the Implementation Challenges section we can quickly see that the majority of our problems seem to revolve around setting up the proper network environment. Google Cloud VMs are also frustrating to work with because they will sometimes randomly start hanging midway through an experiment after a few runs and must be shut down and restarted before it will work again. This problem is unique to the Google Cloud VMs; we did not see this behavior when running the experiments on our local machines. Also, the SSH sessions to a Google Cloud VM would often disconnect for no reason if accessed from the browser pages rather than from the terminal.

References

[1]M. Alizadeh, S. Yang, S. Katti, N. McKeown, B. Prabhakar and S. Shenker, “Deconstructing Datacenter Packet Transport”, Hotnets, 2012.

[2]M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar and S. Shenker, “pFabric”, ACM SIGCOMM Computer Communication Review, vol. 43, no. 4, pp. 435-446, 2013.

One response to “CS244 ’17 pFabric: Deconstructing Datacenter Packet Transport

  1. Reproducibility: 4.5/5
    We easily reproduced the graphs presented in the blog post by following the steps provided by the authors.

    However, the graphs do not match exactly with the ones in the blog post. The 99% data are quite random as already suggested by the authors. We also observe a big discrepancy in the CDF of the average completion time for the web workload. In our results it seems like TCP and pFabric have similar performance instead of pFabric clearly outperforming TCP for large load as the authors claim.

    To conclude, it was very easy to reproduce the results following the authors’ instructions. However, the fact that tail latency is a very volatile metric and the selection of Mininet as emulation platform lead to unstable results. The problem could be potentially alleviated by running more iterations.

Leave a comment