Eli Berg and Gus Liu
QJump was presented by Matthew P. Grosvenor et. al at NSDI 2015 as an algorithm for mitigating the effects of network congestion due to cross-application interference in datacenters. It aims to enforce the trade off between high throughput needs and low latency guarantees by coupling the two in a software rate limiter. The underlying idea is that a latency sensitive application, such as PTPd, should be willing to accept stricter rate limits in exchange for a higher packet priority, while high throughput applications such as Hadoop would make the opposite trade-off. Their discussion of “maximum fan-in” concludes that one can bound the expected possible latency of a network in any given epoch by the ingress to a single end host if every other host on the network could potentially attempt to communicate with that one simultaneously. This abstraction forms the basis for their rate limiting equation.
In a single virtual-output queued switch, the authors note that in the worst case, the number of packets that arrive concurrently is equal to the maximum fan-in of the switch, or the number of input ports on the switch. Thus, a packet must wait for at most (max fan-in – 1) ~= n packets before it is serviced. Because a packet of size P will take P/R seconds to transmit at link-rate R, the maximum interference delay can be bounded at (n*P/R)+ε. In the mesochronous case of accounting for different phase relationship between network epochs, the network epoch can be bounded as (n*P/R)+ε. Thus, if hosts are limited so that they can only send one packet each network epoch, then no packet will take more than one network epoch to be delivered in the worst case.
QJump is designed to be simple and immediately deployable, requiring no change to hardware or application code. It works by assigning 802.1Q priorities to application traffic, which is then rate-limited at the network interface by a kernel traffic control (TC) module. The implementation of QJump is comprised of a Linux TC module, which handles rate-limiting, and an application utility, which modifies vanilla application binaries to tag their outgoing traffic with a specified 802.1Q priority.
The authors found that QJump exhibits better interference control than existing schemes such as ECN or DCTCP. The variance in Hadoop, PTPd and memcached performance is close to (Hadoop, PTPd) or slightly better than (memcached) in the uncontended ideal case. Moreover, QJump provides excellent overall average and 99th percentile flow completion times. It performed best on short flows, although it achieves similar or better results than pFabric on average and 99th percentile FCTs. However, on large flows, the results are mixed.
Subset of Results Reproduced
We aimed to reproduce the experiment shown in figure 5, in which PTPd and memcached are run over an otherwise empty network both with and without a currently running Hadoop job. We wanted to see the addition of QJump to the hosts eliminate the noticeable effects of a running Hadoop job relative to the effects experienced in an otherwise empty network.
We chose this particular experiment to reproduce because of all those presented in the paper, it seemed like a very realistic application of the system to a plausible datacenter workload. Additionally, the applications are clearly representative of various classes of network traffic. PTPd, which is used to periodically synchronize the clocks of all hosts running the daemon over the local network, is incredibly latency sensitive. Memcached, a distributed key-value caching system, moves relatively small portions of data at a time but also benefits from low latency to maintain cache consistency. It can therefore afford to be rate limited to an extent in exchange for lower latency guarantees. Large Hadoop jobs move a lot of data around, but they generally aren’t very time-sensitive. When the three are run concurrently, we see that QJump is effective at several priority levels.
Rather than conduct the experiment over a hardware network that approximated a 12-node datacenter topology, as was done originally, we chose to conduct the experiment over Mininet. This introduced several issues with conducting the experiment itself. Perhaps the most obvious is the issue of scale. We used 15Mb/s simulated links in Mininet, whereas the original experiment was conducted over 10Gb/s links. For the configuration of the TC module, we adjusted the timeq arg, signifying the network epoch, to 19200µs in accordance with the (n*P)/R equation. We also reduced the scale of the memcached load generated by memaslap in several ways. We reduced the size of each transfer from 1024 bytes to 64 bytes, the concurrency level from 128 to 1, since all emulated hosts shared were sharing the same hardware, and the number of concurrent loads from 25 to 10. Even in spite of these changes in scale, we were unable to reproduce the latencies demonstrated in the paper. The three experiments were run for 10 minutes each, as in the paper.
Making QJump work with Mininet – We encountered many of the same implementation challenges as the group last year, who reproduced different QJump results but had a similar approach. Luckily, we were able to use their descriptions of the issues to greatly simplify the search for workable solutions. Specifically, we implemented solutions and workarounds for the following problems – using the kernel clock instead of TSC to judge timing in the qjump-tc module, installing QJump as a child class of Mininet’s TC module, configuring Mininet to use VLAN hosts to support priority tagging, supporting multiple queues in hosts, and supporting multiple queues for QoS at the switch.
Hadoop – We found that it was intractable to run Hadoop over Mininet, given a shared file system for configuration. Considering that Hadoop itself had no role in the experiment other than traffic generation, and the traffic itself was only visible via latency data collected from the other two applications, we abandoned running Hadoop directly over Mininet. Instead, we decided to write a very simple traffic generator that simulates the MapReduce steps of file generation, distribution, shuffling, and aggregation by sending files of various sizes from one host to another via the netcat utility. This was easier to scale than a Hadoop job as well. It was hard to judge how well this worked, because traffic interference was only visible via the data from the other applications.
Scaling and Memcached – Even with greatly reduced scale across all memaslap load generation parameters, we were still unable to attain the target latencies exhibited by the original experiment. The stat dumps for memcached were missing the expected lines for GET/SET statistics so we had to modify the data processing to use another latency measurement.
Clearly we were not able to replicate the results observed in the paper. The experiment was an attractive target for replication in a simulated environment because while it does show a clear effect of applying the QJump system, it demonstrates the effect qualitatively with respect to a baseline measurement from the same system. In theory, this means that even with results which appear entirely dissimilar from those in the paper, we should still be able to reproduce the qualitative relationship of the three graphs. We were able to demonstrate this relationship to some extent.
PTPd does not exhibit the same flat line behavior as it does in the original experiment. This is understandable because in our virtual topology, all the hosts share hardware, whereas the hosts in the original experiment were not only separate, but each dedicated to a single running instance of only one of the three applications. On shared hardware, a highly latency-sensitive application such as PTPd will be affected not only by network interference, but by the processing load on the machine itself. Our results demonstrate this as a large, semi-regular deviation from 0 in the PTPd offsets. In the second graph, in the presence of network interference from the MapReduce simulator, the offset is significantly higher at times. This effect disappears in the third graph, indicating that QJump had some effect of mitigating the network interference. The more sustained shape of the deviations in the third graph could be due to a deviation in the pattern of network interference, or it could just signify a difference in the pattern of hardware work on the machine itself caused by QJump.
In a correspondence with the author, he expressed doubts about the feasibility of reproducing the experiment on such a reduced scale. Even with the physical network topology used in the paper, they were at the small end of a conceivable datacenter topology. Tools like Hadoop, PTPd, and memcached are generally used in large distributed systems and in some sense it doesn’t seem in any way useful to run them on a virtual topology within a single machine, particularly one with a shared filesystem. At the same time however, with the growth of network virtualization, it has become much more conceivable that whole datacenter topologies could be represented on a very few machines and the issues we encountered in reproducing this experiment are very relevant to that.
Reproduce Our Results
See our GitHub