CS244 ’16: PCC Shallow Queues and Fairness


Authors: Evan Cheshire and Wyatt Daviau

Key Results

PCC performance degrades in virtual environments under certain conditions.  However even in these environments multiple PCC flows show significantly greater stable convergence and fairness when compared with typical TCP flows.

Sources

[1] J. Jiang, V. Sekar, and H. Zhang. Improving fairness, efficiency, and stability in http-based adaptive video streaming with festive. Proc. CoNEXT, December 2012
[2] Emulab
[3] Mininet
[4] PCC Repository
[5] Previous CS 244 PCC blog post, Shin and Dia
[6] ipfw dummynet documentation

The PCC Paper: M Dong et al. PCC: Re-architecting Congestion Control for Consistent High Performance.  NSDI May 2015.

Introduction

The goal of the PCC paper is to develop a transport protocol with significantly better performance than TCP while maintaining a level of practical deployability. This performance improvement was measured in terms of throughput in a variety of network settings, and various measures of fairness.  The authors claim that the hardwired-mapping of network events to specified reactions by necessity make faulty assumptions about networks that are increasingly complex.  The authors then propose that the next generation of transport protocols should rely on a more generalized real-time view of network conditions to make sending rate decisions.

PCC accomplishes this by running A/B tests during execution by sending data at different rates.  The protocol moves the sender towards operating at the rate that performs the best with respect to optimizing some utility function.  The utility function can be set by the user and in general can take into account measured throughput, loss rate and latency.  The implementation tested in the paper only considers throughput and loss rate.  Strikingly, with this utility function, PCC outperforms TCP by nearly 10x in a large variety of network conditions.  This implementation has been shown to greatly surpass TCP in terms of convergence and exhibit inter-PCC fairness that surpasses the fairness between TCP flows.  PCC friendliness and implementation may break down in virtualized settings, a subject analyzed by previous CS 244 students.  PCC is built on top of UDP and so requires no new hardware or header support.

Motivation

We found this paper exciting.  The problem of improving network performance is at the heart of computer networking. To address this problem the authors provides a lucid look into the current paradigm of congestion control used by TCP and come up with an alternative that seems more reasonable.  If the authors’ evaluation is fair then this reasonable alternative proves itself over and over to be a better way to approach mitigating network congestion.  What really makes PCC stand out is the variety of conditions under which it is successful, such as satellite, lossy, or variable links.

Author’s Results

The authors test PCC performance in many different settings across many different variables against many different strains of TCP.  The authors emulate highly lossly links and satellite links where PCC outperforms a particular TCP variant designed for the respective system.  PCC is shown to perform better on shallow-buffered links and rapidly changing networks with a higher throughput than TCP.  In addition to performance the PCC paper evaluates the protocol in terms of convergence and fairness.  The paper shows evidence that PCC flows are better behaved when competing for resources among themselves than TCP flows.

Our Subset Goals and Motivation

We decided to focus on two different aspects of PCC, its performance characteristics and its stability and fairness properties.  Groups in the past have seen success reproducing these performance gains in two hostile network settings: satellite and artificially lossy links.  We wanted our contribution to test a different network feature.  The shallow buffer experiment suggests that deployment of PCC could mitigate the bufferbloat problem by providing flows high throughput even when buffers are kept small.  Specifically our goal is to recreate Figure 7 from the PCC paper shown below, measuring throughput vs. buffer size.

Shallow_Queue_Paper

After talking with Mo Dong we became interested in the improved convergence and fairness properties between multiple PCC flows as compared with TCP.   These properties have not yet been reproduced and make PCC appealing for video streaming applications, which rely on fairness and stability when sending multiple flows [1].  We decided to assess PCC’s multi-flow convergence, which we did by recreating Figure 12 of the original paper, in order to explore whether PCC could be applied to such specific use cases.

Multi_Flow_Paper

Our Results

Platform and Setup

We decided to use Mininet because it was familiar and a simple tool for setting up different topologies.  We ran our experiments on a variety of AWS Ec2 Instances so that we could run an ubuntu machine to support Mininet and PCC and maintain the flexibility to experiment with changing the specifications of our machine.  Avoiding a local VM was also convenient for our development and testing speed.  We had to move from the free AWS tier to larger instances in order to handle the load on the system, as the smaller instances were not able to handle the load of 4 separate PCC flows and would crash.  Our current understanding is that the instance type matters a lot for the results.  Particularly for the multi-flow experiment the instance should have at  least 8 cores.  Our best results to date ran in the c4-2XL instance type.

Additionally we used the Emulab network testbed to run our experiments on bare metal machines as was originally done in the PCC paper.  This allowed us to better understand the effects of virtualization on our results, which previous CS244 experiments have warned can be significant [5].  Emulab emulates network conditions by assigning dedicated machines as bridges that connect hosts with ipfw’s dummynet functionality [6].

Our two experiments required separate topologies.  We ran the shallow buffer queue experiment with a simple two host network with a switch in the middle on both Mininet and Emulab.  One link was set at 1000 Mb/s and the bottleneck was set to 100 Mb/s.  The queue size was ranged from 5-275 packets.  For the fairness topology, we created a wishbone network, with a separate host for each sender and receiver, and two switches in the middle.  The bottleneck was the link between two switches, set at 100 Mbps speed. We chose a queue size that could handle the bandwidth delay product to allow TCP to achieve maximum throughput, we calculated this to be 250 packets.  For the wishbone network, we also used Mininet and Emulab.  However, due to limited resources, we were only able to acquire a two host-recevier wishbone network on Emulab which limited the number of flows we could send.

In both environments we measured PCC throughput using reports from the provided executables [4], appclient and appserver, and TCP throughput with iperf.  In both cases throughput was averaged over 60 seconds as was done in the original work.  We use a PCC implementation with the utility function evaluated in the paper.  Both experiments used TCP Cubic as the TCP variant for comparison because Cubic was included in both original figures we reproduced and is the default for most OSs

Shallow Queue Results:

The results for the Emulab and Mininet Queue Sweeps are shown below:

Shallow_Queue_MininetEmuQueue.png

The following graphs display our results both on Mininet and Emulab.

Based on the Mininet graph, PCC has underperformed on the Mininet setup, not increasing at the rate displayed in the original shallow queue graph.  This is not the case for the Emulab setup, where PCC holds up to expectations.  Note that TCP Cubic’s results are virtually the same between the two environments.  After talking with Mo Dong and reading the previous CS244 PCC experiment, we believe that the degradation is due to the PCC implementation’s reliance on accurate timing for pacing.  A previous CS244 reproduction observed similar degradation when running PCC in a virtualized environment over an emulated network with shallow buffers [5].  The current hypothesis is that PCC relies on sending packets at specific times, but because the host OS of an Ec2 instance is periodically interrupted, PCC’s even pacing degrades to sending larger batches of packets when the host OS wakes back up.  A small buffered network by necessity drops more packets in such a burst than it would have if all packets had been sent out at their intended times.  This hypothesis could explain our results, which show PCC performance approach the expected value as we increase our bottleneck buffer size.

Note: we tried running this experiment on a dedicated host to mitigate these problems (plot not shown), but we actually saw even more degradation even at larger buffer sizes.  In retrospect this was possibly due to our dedicated host being a single core machine, which necessitates PCC senders and receivers mutually interrupting each other and further degrading pacing.

 

Convergence and Fairness Results:

The following images display our results in both Emulab and Mininet for running concurrent PCC and TCP flows:

EmulabPCCFairPCC_2_flowsPCC_4_flows

TCP_4_flows

pccmultitcpmulti

 

Although our Emulab setup only included two senders and receivers, we saw that the two flows converged to the same value quickly and experienced little variation during the run, true to the original paper’s experiment.    We ran on several different instances and got different results.   On an m3-large (2 vCPUS), the “mininet multi-flow” plots shown above, the flows are more stable and fair than TCP flows but still significantly more variable than those in Figure 11 of the original paper.  Running on a c4-2XL instance with 8 virtual CPUs we achieved our final result, the “8-core mininet multi-flow” plots seen above.  The stability and fairness of the different flows run in this setting is comparable to that seen by the PCC authors.  We have included a plot of a 2-flow Mininet run (on the m3-large) to provide a better visual comparison to the Emulab data.

To quantify PCC’s improvement over TCP we recorded statistics from the 4 flow  experiment.  We measured average Jain’s fairness index, a standard measure which approaches 1 as multiple flows get a more fair amount of throughput, at a time interval of 100s.  We also include the average standard deviation as a measurement of the stability of each flow, where a higher standard deviation of throughput corresponds to greater instability.  We averaged the statistics over each interval the recorded number of flows is active.  For example we took 3 flow statistics on throughput data from t = 1000 – 1500 s and t = 2000 – 2500 s.

Fairness Graph

The data shows that multiple PCC senders have better stability and fairness over the course of our experiment.  The experiment running on the c4-2XL shows comparable fairness to the results reported in the original paper, in which all JFIs at 100s intervals are between 0.98 and 0.99.

 

Challenges

  • We originally thought running our simulations on t2 micro and nano instances would be sufficient, however without fail these machines would crash during longer runs of the PCC convergence tests whenever we sent 3 or more flows for over 30 seconds.
  • It was difficult to get time on the Emulab testbed to run our reproduction in the paper’s original environment.  Because of limited resources we were only able to use 2 senders and 2 receivers and so were not able to reproduce the convergence test in full.  A big Thanks to Mo for helping us get access and set up the Emulab topology.
  • Probably the most difficult challenge to overcome was the unpredictability of the effects of virtualization on our results.  We witnessed a variation of PCC throughput that to the best of our knowledge is related to the size of different instances.  However some of our observations were paradoxical and difficult to explain.  For instance an m3-medium on a dedicated host showed greater shallow-queue performance degradation than an on demand t2-medium.  We had a lot of trouble getting good fairness and convergence from our Mininet setup of the multi-flow experiment.  For example on an m3-medium instance the flows were fair but their max throughput was 30% of the allowed max.

Critique and Conclusions

From our experiments we’ve drawn two major conclusions:

  • When running in a virtualized environment the PCC implementation breaks down the worst when network buffers are very shallow.  It is at these small queue sizes that the original paper (corroborated by our Emulab data) demonstrates PCC has the greatest advantage over TCP, so this result lessens the impact of Figure 7 in the paper.  Virtual servers are overtaking their bare-metal counter parts across the internet today.  Deploying PCC would do little to improve the performance of communicating with these machines over such shallow buffer links.  To be fair, although the authors do not mention this in the original paper the PCC github repository is informative on the issue.  Additionally the authors claim that the pacing issue can be fixed in their implementation.  If this can be shown we believe it would significantly improve the evaluation.
  • Running PCC in virtualized environments still provides users with significantly better convergence and fairness properties over TCP Cubic, as long as the network buffers are big enough and enough processing power is dedicated to the senders and receivers.  We saw significant loss of these properties when running 4 flows on machines with fewer than 8 cores, but similar results to the original paper when running on an 8 cores instance.  Our hypothesis, supported by the PCC Repo’s “Known Issues” section is that each sender and receiver ought to be on a different core to prevent performance breaking interrupts that damage pacing accuracy and hence fairness and convergence properties.  We do not know why these effects extend to networks with queue sizes as large as 250 packets when the adverse effects of virtualization alone are attenuated.  The authors believe that this is also an implementation problem and are currently looking at SENIC as a way to manage pacing without “eating one core per flow.”  However if enough hardware is dedicated to the current implementation, PCC can provide better flow stability and fairness than TCP Cubic to all users, virtualized or not.

 

Running Our Experiment

Using our Provided Amazon AMI (Recommended)

We have provided an Amazon AMI which you can use to launch an EC2 instance with our code already installed and ready to run.  To get setup begin by going to your AWS EC2 management console and click “Launch Instance.” Make sure that your region is set to “Oregon” using the region tab at the top right of your screen.  In the gray box on the left of the screen click on the “Community AMIs” tab.  Next copy our AMI code 21e21b41 into the “Search community AMIs” box.  Select this AMI and choose the c4-2XL instance type to launch the instance.

To login to the instance return to your EC2 management console and wait for the instance to finish initializing.  Use the public DNS of the form ec2-xx-yy-zzz-aaa.us-west-2.compute.amazonaws.com to ssh into the instance as user “ubuntu”.  Assuming you’ve kept track of your key in key.pem your command will look like: ssh -i key.pem ubuntu@ec2-xx-yy-zzz-aaa.us-west-2.compute.amazonaws.com.

Once you have gained access type the sequence of commands below to run experiments and plot our data.  Note that our experiments take about 3 hours total to run (1 hour PCC multiflow, 1 hour TCP multiflow, 1 hour Shallow-Queue Sweep), so please be patient.  We’ve included mininet cleanup calls in between runs to ensure smooth sailing.  The commands:

cd cs244_researchpcc
sudo ./small-queue.sh
sudo mn -c
sudo ./multi-flow.sh

This will run both experiments, generating the TCP and PCC 4 flow convergence tests and queue sweeps.  The output plots of interest are pccmulti.png, tcpmulti.png, and Shallow_Queue.png.  These plots will be saved in your current directory and can be sent back to your host for viewing using scp.  Additionally the multi flow experiment’s statistics should be reported to standard output.

If you are interested in recreating our PCC 2-flow convergence test feel free to modify multi-flow.sh to have 4 hosts with a flow_time of 1000 each and then rerun.  However this plot was mainly included for a clear visual comparison with the Emulab results.  It is a subset of the 4 flow convergence test and therefore need not be reproduced for 244 purposes.  Similarly we included a convergence test on a machine with fewer cores to highlight our experimental process but it is not the core result we are presenting.  However, if you are interested in reproducing the plots of the convergence test on fewer cores you simply need to run the AMI in an m4-large instance and repeat the process above.

Mo Dong graciously helped us get access to Emulab, however he cannot extend the same courtesy to everyone.  If you want to reproduce the Emulab section of the results you can contact the Emulab admins and can apply for time based on their rules [2].

Starting from Scratch

If you would like to run these experiments without the machine image we have a public git repository that can be used to setup a new instance from scratch.  Launch and login to a new c4-2XL ubuntu instance and launch the following commands:

sudo apt-get update
sudo apt-get install git
git clone https://bitbucket.org/ZenGround0/cs244_researchpcc.git
cd cs244_researchpcc
sudo ./start_me_up.sh

At this point mininet [3] and the PCC source [4] are installed along with matplotlib.  To make PCC with the utility function in the paper edit pcc/sender/app/cc.h by commenting out line 303 and uncommenting line 302. Next return to cs244_researchpcc and run our simple install script to make pcc:

sudo ./install_pcc.sh

At this point your instance should be in the same state as our AMI.  See the above section for running the experiments and generating plots.

Advertisements

One response to “CS244 ’16: PCC Shallow Queues and Fairness

  1. Reproducibility Score: 3/5

    Reproducibility: We had trouble reproducing the results. For the first script, ./small-queue.sh, we ran it on two different EC2 machines. On one of the machines, we attempted it twice and it either didn’t finish or outputed an error message:

    Traceback (most recent call last):
    File “plot_throughputs.py”, line 78, in
    plot_pcc_throughputs(axPlot)
    File “plot_throughputs.py”, line 18, in plot_pcc_throughputs
    avg = sum(throughputs)/len(throughputs)
    ZeroDivisionError: integer division or modulo by zero

    On the other machine, we were able to successfully run the script and reproduce the results.

    For the second script, ./multi-flow.sh, we ran it multiple times on two machines. For all attempts and all machines, we ran the script for 2+ hours only to see the following output at best:

    ubuntu@ip-172-31-24-244:~/cs244_researchpcc$ sudo ./multi-flow.sh
    topo made
    r1 r1-eth0:switch2-eth1
    r2 r2-eth0:switch2-eth2
    r3 r3-eth0:switch2-eth3
    r4 r4-eth0:switch2-eth4
    s1 s1-eth0:switch1-eth1
    s2 s2-eth0:switch1-eth2
    s3 s3-eth0:switch1-eth3
    s4 s4-eth0:switch1-eth4
    switch1 lo: switch1-eth1:s1-eth0 switch1-eth2:s2-eth0 switch1-eth3:s3-eth0 switch1-eth4:s4-eth0 switch1-eth5:switch2-eth5
    switch2 lo: switch2-eth1:r1-eth0 switch2-eth2:r2-eth0 switch2-eth3:r3-eth0 switch2-eth4:r4-eth0 switch2-eth5:switch1-eth5
    sending round1
    launching tcp flow
    sending round2
    launching tcp flow
    sending round3
    launching tcp flow
    sending round4
    launching tcp flow

    It seemed to hang on round4 tcp flow. We even re-pulled the repository and tried it again and different times of the day with no success.

    Sensitivity Analysis: We resonate with the comments relating to unprdictability of the effects of virtualization on the results. It was unclear why some instances were working and others weren’t, but we did our best to give the instructions multiple attempts on multiple machines. Why it worked on one machine and not on another, we have no idea why. We found the conclusion and critique of the blog post to be good. The graphs look promising and the analysis seems to agree. It could be helped by greater quantification of results to explain the conclusions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s