HULL: High Bandwidth, Ultra Low Latency

Team Members: Ben Shapero and Vaibhav Chidrewar.

Key Results:
HULL, using software-emulated phantom queues, shows significant reductions in packet transmission latency with only a fractional drop in throughput – even without strict flow pacing.

References:

Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center (NSDI 2012). Mohammad Alizadeh, Stanford University; Abdul Kabbani, Google; Tom Edsall, Cisco Systems; Balaji Prabhakar, Stanford University; Amin Vahdat, Google and U.C. San Diego; Masato Yasuda, NEC Corporation, Japan

Data Center TCP (DCTCP) http://www.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf (Sigcomm 2010). Mohammad Alizadeh, Stanford University; Albert Greenberg, Microsoft; David A. Maltz, Microsoft; Jitendra Padhye, Microsoft; Parveen Patel, Microsoft; Balaji Prabhakar, Stanford University; Sudipta Sengupta, Microsoft; Murari Sridharan, Microsoft

Random Early Detection gateways for Congestion Avoidance S. Floyd and V. Jacobson
IEEE/ACM Transactions on Networking, Vol. 1, No. 4, pp. 397-413, August 1993.

Contacts:
Ben Shapero – bshapero@stanford.edu, Vaibhav Chidrewar – chirewarvaibhav@yahoo.co.in

Introduction:

Normal drop-tail TCP congestion control, while very effective at fully utilizing the available bandwidth of a link, pays a cost in latency. A single packet drop will cause the loss of an entire congestion window, maximizing the latency penalty of the connection. In data centers, where many services are latency-sensitive, this behavior is unacceptable. DCTCP was created to solve this problem. While DCTCP does have lower latency than regular TCP, it still buffers packets in order to maintain maximum bandwidth.

Hull expands upon DCTCP, sacrificing a small amount of bandwidth in order to keep the queuing buffer essentially empty, resulting in latencies many times lower than even DCTCP. It does this through the use of two mechanisms, traffic pacing and phantom queues. Phantom queues are queues that “drain” at a lower rate than the actual link rate, but never actually buffer packets. This “lag behind” prevents packets from building up in the queue, essentially eliminating switch latency. This project tests the effectiveness of implementing phantom queues in the emulated network platform Mininet. We have found that emulated phantom queues succeed in greatly reducing latency with only a small loss of throughput.

Methods:

In Mininet, programmers can specify arbitrary network topologies and configure individual nodes (hosts and switches), links, and interfaces. As in the Hull paper, we tested on a simple star topology, with a single switch connecting n hosts, with separate runs for n = 3-10. Host 1 served as a receiver, while the other n-1 hosts all send TCP iperf flows to the receiver. All links in the topology are set to 50Mbps, as Mininet does not have the capacity to emulate that many 1Gbps links. Packet size of these flows is 1500 bytes. During the test, we collect information on throughput and queue occupancy on the switch-to-receiver link and latency on the sender-to-switch interfaces.

Implementing phantom queues required modifying the implementation of RED in the linux kernel. Normally, RED estimates the size of the queue by taking averages over a period of time, and probabilistically dropping/marking packets if the queue is between min and max. We added two fields to the RED parameters, a hull_ctr that counts the number of bytes in the queue, and a timestamp of when the last packet arrived. At every packet arrival, instead of the regular queue size calculations, we reduce hull_ctr by the number of bytes that could have been sent at the drain rate since the previous timestamp, then increase the counter by the size of the packet (1500 bytes). Then, if the hull_ctr > min, then RED is told to mark the packet. No other modifications to red were made or required. Mininet uses the linux kernel tc qdisc implementation, so inserting our compiled red-hull module into the linux kernel is all the installation needed.

Results:

Our results confirm the findings summarized by figure 8 in the Hull paper. The average latency for a single flow was lower for DCTCP w/ PQ than DCTCP w/o PQ and both were significantly lower than simple drop-tail TCP. The bandwidth is slightly lower for DCTCP w/PQ, but the drop is quite small. The bandwidth when the drain rate was 95% of the line rate of 50Mbps was still above 90%.

Phantom queues also succeed in keeping the queue buffer essentially empty during the entire lifetime of a flow. The plot below show that the number of packets in the queue was always between 0 and 2. Because we do not use pacing to remove variations in traffic arrival time, this small backup is not surprising, and the queue very quickly recovers.

Conclusions:

Hull even with no pacing and only phantom queues can keep buffer occupancy extremely low. Good pacing, as described in the original paper, would probably reduce the occupancy even further. In our experiments, the queue had a small occupancy of one to two packets during most of the lifetime of the test. This variation can be explained by a packet arriving before the current packet has completely exited the switch, causing the packet to hang in the buffer until the old packet is completely sent. The amount of bandwidth sacrificed is small relative to the drop in latency. We configured the phantom queues to drain at 95% of the link speed, but average throughput was still over 90% of TCP’s maximum bandwidth. Mininet, while limited in its capacity, is more than capable of replicating these types of experiments. In particular, the original Hull paper implemented phantom queues in hardware, while Mininet is run entirely in software. This transition did not change the overall results from the Hull paper. By sacrificing a small amount of bandwidth, switch latency can be essentially eliminated.

Acknowledgements:

Special thanks to Prof. Nick McKeown, Nikhil, Brandon, Bob ,Vimal and Mohammad (Author of HULL paper) for guiding us throughout.

Instructions to Replicate This Experiment:

To replicate :
1. Launch a *c1.xlarge* instance on Amazon EC2 Oregon data-center using AMI – CS244_HULL_Vai_Ben
2. Log-in to the instance with username ubuntu.

(Mac,Linux)Please use -X option with ssh command.

(Windows) Please enable X11 forwarding for SSH connection. In case of Putty it can be done through Connection–>SSH-> X11->Check ‘Enable X11 forwarding’ box

3. Go to home directory
cd /home/ubuntu
4.Run the following command
sudo ./vrun.sh
5. It would take half and hour to run the experiment and following two graphs will be generated

1) /home/ubuntu/cs244-12-bs/all-latency.png

2) /home/ubuntu/cs244-12-bs/all-throughput.png

One response to “HULL: High Bandwidth, Ultra Low Latency”

Harshit Chopra June 8, 2012 at 6:33 pm · · Reply →

I was able to reproduce the last 2 graphs exactly as they are without much problems, except that the original script required X11 forwarding, which I couldn’t on my Mac. After the fix, I was able to generate the two graphs above and they were exactly the same as shown above.

Except the small problem, everything else worked like a charm.

Reproducing Network Research

network systems experiments made accessible, runnable, and reproducible

HULL: High Bandwidth, Ultra Low Latency

One response to “HULL: High Bandwidth, Ultra Low Latency”

Leave a comment Cancel reply

Share this:

Related

One response to “HULL: High Bandwidth, Ultra Low Latency”

Leave a comment Cancel reply