CS244 ’16: QJUMP – Controlling Network Interference


Eli Berg and Gus Liu

Introduction

QJump was presented by Matthew P. Grosvenor et. al at NSDI 2015 as an algorithm for mitigating the effects of network congestion due to cross-application interference in datacenters. It aims to enforce the trade off between high throughput needs and low latency guarantees by coupling the two in a software rate limiter. The underlying idea is that a latency sensitive application, such as PTPd, should be willing to accept stricter rate limits in exchange for a higher packet priority, while high throughput applications such as Hadoop would make the opposite trade-off. Their discussion of “maximum fan-in” concludes that one can bound the expected possible latency of a network in any given epoch by the ingress to a single end host if every other host on the network could potentially attempt to communicate with that one simultaneously. This abstraction forms the basis for their rate limiting equation.

In a single virtual-output queued switch, the authors note that in the worst case, the number of packets that arrive concurrently is equal to the maximum fan-in of the switch, or the number of input ports on the switch. Thus, a packet must wait for at most (max fan-in – 1) ~= n packets before it is serviced. Because a packet of size P will take P/R seconds to transmit at link-rate R, the maximum interference delay can be bounded at (n*P/R)+ε. In the mesochronous case of accounting for different phase relationship between network epochs, the network epoch can be bounded as (n*P/R)+ε. Thus, if hosts are limited so that they can only send one packet each network epoch, then no packet will take more than one network epoch to be delivered in the worst case.

QJump is designed to be simple and immediately deployable, requiring no change to hardware or application code. It works by assigning 802.1Q priorities to application traffic, which is then rate-limited at the network interface by a kernel traffic control (TC) module. The implementation of QJump is comprised of a Linux TC module, which handles rate-limiting, and an application utility, which modifies vanilla application binaries to tag their outgoing traffic with a specified 802.1Q priority.

Results

The authors found that QJump exhibits better interference control than existing schemes such as ECN or DCTCP. The variance in Hadoop, PTPd and memcached performance is close to (Hadoop, PTPd) or slightly better than (memcached) in the uncontended ideal case. Moreover, QJump provides excellent overall average and 99th percentile flow completion times. It performed best on short flows, although it achieves similar or better results than pFabric on average and 99th percentile FCTs. However, on large flows, the results are mixed.

Subset of Results Reproduced

We aimed to reproduce the experiment shown in figure 5, in which PTPd and memcached are run over an otherwise empty network both with and without a currently running Hadoop job. We wanted to see the addition of QJump to the hosts eliminate the noticeable effects of a running Hadoop job relative to the effects experienced in an otherwise empty network.

Screen Shot 2016-05-23 at 9.46.31 PM.png

Original Results


We chose this particular experiment to reproduce because of all those presented in the paper, it seemed like a very realistic application of the system to a plausible datacenter workload. Additionally, the applications are clearly representative of various classes of network traffic. PTPd, which is used to periodically synchronize the clocks of all hosts running the daemon over the local network, is incredibly latency sensitive. Memcached, a distributed key-value caching system, moves relatively small portions of data at a time but also benefits from low latency to maintain cache consistency. It can therefore afford to be rate limited to an extent in exchange for lower latency guarantees. Large Hadoop jobs move a lot of data around, but they generally aren’t very time-sensitive. When the three are run concurrently, we see that QJump is effective at several priority levels.

Screen Shot 2016-05-28 at 8.20.38 PM.png

Our Results

 

Rather than conduct the experiment over a hardware network that approximated a 12-node datacenter topology, as was done originally, we chose to conduct the experiment over Mininet. This introduced several issues with conducting the experiment itself. Perhaps the most obvious is the issue of scale. We used 15Mb/s simulated links in Mininet, whereas the original experiment was conducted over 10Gb/s links. For the configuration of the TC module, we adjusted the timeq arg, signifying the network epoch, to 19200µs in accordance with the (n*P)/R equation. We also reduced the scale of the memcached load generated by memaslap in several ways. We reduced the size of each transfer from 1024 bytes to 64 bytes, the concurrency level from 128 to 1, since all emulated hosts shared were sharing the same hardware, and the number of concurrent loads from 25 to 10. Even in spite of these changes in scale, we were unable to reproduce the latencies demonstrated in the paper. The three experiments were run for 10 minutes each, as in the paper.

Challenges

Making QJump work with Mininet – We encountered many of the same implementation challenges as the group last year, who reproduced different QJump results but had a similar approach. Luckily, we were able to use their descriptions of the issues to greatly simplify the search for workable solutions. Specifically, we implemented solutions and workarounds for the following problems – using the kernel clock instead of TSC to judge timing in the qjump-tc module, installing QJump as a child class of Mininet’s TC module, configuring Mininet to use VLAN hosts to support priority tagging, supporting multiple queues in hosts, and supporting multiple queues for QoS at the switch.

Hadoop – We found that it was intractable to run Hadoop over Mininet, given a shared file system for configuration. Considering that Hadoop itself had no role in the experiment other than traffic generation, and the traffic itself was only visible via latency data collected from the other two applications, we abandoned running Hadoop directly over Mininet. Instead, we decided to write a very simple traffic generator that simulates the MapReduce steps of file generation, distribution, shuffling, and aggregation by sending files of various sizes from one host to another via the netcat utility. This was easier to scale than a Hadoop job as well. It was hard to judge how well this worked, because traffic interference was only visible via the data from the other applications.

Scaling and Memcached – Even with greatly reduced scale across all memaslap load generation parameters, we were still unable to attain the target latencies exhibited by the original experiment. The stat dumps for memcached were missing the expected lines for GET/SET statistics so we had to modify the data processing to use another latency measurement.

Discussion

Clearly we were not able to replicate the results observed in the paper. The experiment was an attractive target for replication in a simulated environment because while it does show a clear effect of applying the QJump system, it demonstrates the effect qualitatively with respect to a baseline measurement from the same system. In theory, this means that even with results which appear entirely dissimilar from those in the paper, we should still be able to reproduce the qualitative relationship of the three graphs. We were able to demonstrate this relationship to some extent.

PTPd does not exhibit the same flat line behavior as it does in the original experiment. This is understandable because in our virtual topology, all the hosts share hardware, whereas the hosts in the original experiment were not only separate, but each dedicated to a single running instance of only one of the three applications. On shared hardware, a highly latency-sensitive application such as PTPd will be affected not only by network interference, but by the processing load on the machine itself. Our results demonstrate this as a large, semi-regular deviation from 0 in the PTPd offsets. In the second graph, in the presence of network interference from the MapReduce simulator, the offset is significantly higher at times. This effect disappears in the third graph, indicating that QJump had some effect of mitigating the network interference. The more sustained shape of the deviations in the third graph could be due to a deviation in the pattern of network interference, or it could just signify a difference in the pattern of hardware work on the machine itself caused by QJump.

In a correspondence with the author, he expressed doubts about the feasibility of reproducing the experiment on such a reduced scale. Even with the physical network topology used in the paper, they were at the small end of a conceivable datacenter topology. Tools like Hadoop, PTPd, and memcached are generally used in large distributed systems and in some sense it doesn’t seem in any way useful to run them on a virtual topology within a single machine, particularly one with a shared filesystem. At the same time however, with the growth of network virtualization, it has become much more conceivable that whole datacenter topologies could be represented on a very few machines and the issues we encountered in reproducing this experiment are very relevant to that.

Reproduce Our Results

See our GitHub

 

 

3 responses to “CS244 ’16: QJUMP – Controlling Network Interference

  1. With minor changes, we were able to make the code run to completion, but we were not able to reproduce the qjump results (the bottom third of the graph). Notes on the reproduction instructions:

    – The AMI is available in the US-West-2 (Oregon) region.

    – We had to edit run_experiment.py and run the experiments twice in order to produce all of the data: once running just experiments 1 and 2, and once running experiment 3. When trying to run all three experiments at once, the file containing the results of experiment 3 was not created.

    – There is a typo in the plotting instructions. The script is called plot_ptp_memcached_hadoop_timeline.py, not plot_ptpd_memcached_hadoop_timeline.py

    Here’s what we were able to reproduce: http://i.imgur.com/hghmv6j.png

  2. Hi all,
    Thanks for taking the time to try to reproduce our work. I have a couple of comments to think about.

    1) Using shared infrastructure does not replicate crucial parts of our experimentation environment. We are explicit in the paper that we use isolated, unshared machines to ensure that the effects of QJump alone are visible. In a shared environment you’re likely to see all sorts of other interference which we (explicitly in the paper, page 1) do not try to handle. QJump is about in-network interference only. There are other solutions you’d need to apply to run in a shared environment.

    2) It seems as if you’ve had two problems in reproducing the experiment. It might be helpful to think about them separately. The first problem is about reproduce the baseline numbers, and the second is reproducing our results. Reproducing the second is contingent on reproducing the first. The first problem is also beyond our control. In the paper we mention that the performance we observe with PTP/Memcached on real hardware is inline with other reported results from Facebook and others. In effect we validate our baseline by reproducing others results. To do a fair comparison you would need to do the same. However, trying to get these programs to run in a simulated environment is a tough problem. If you cannot successfully do this, then there is little point in continuing with the next part of trying to replicate our results. If QJump fixes a few microseconds difference in interference, but you have a few hundred from other sources, then you will never see the effect.

    3) QJump very explicitly uses network hardware features available in commodity switches. One of the problems that you do not mention in your writeup is the fidelity of a software switch’s implementation of hardware switch features. I would be surprised if this worked very well because hardware switches are fast and deeply parallel, whereas software switches are slow, and deeply pipelined. It’s very possible that some of the effects your’e seeing are coming from your software switch.

    At a high level, I think there is an overall question that you’ve not asked. Can mininet based simulations effectively simulate all networking research? What are the limits to this kind of simulation and is it possible that in this experiment you’ve stepped out of them? If mini-net is not the right way to do this, then what might be?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s