Team: Antonin Bas and Diego Pontoriero
- Aggarwal, A., Savage, S., Anderson, T. Understanding the Performance of TCP Pacing.
- Appenzeller, G., Keslassy, I., McKeown, N. Sizing Routing Buffers. SIGCOMM ’04.
Network traffic is inherently bursty. As a result, all network switches have buffer queues that absorb additional packets that arrive during traffic bursts. The buffers then drain during traffic lulls, keeping average utilization of outgoing links high and causing fewer packets to be dropped.
One open question in the networking community is whether evenly spacing traffic flows, or pacing, can decrease burstiness and allow for lower delays, better throughput, and smaller required queue sizes. In Understanding the Performance of TCP Pacing (Aggarwal et. al., 1999) the authors investigated the impact of modifying TCP to evenly space packets throughout the duration of the round-trip-time (RTT). Using a simulated network consisting of clients and servers communicating over a single bottleneck link, the authors found that for a single flow, pacing increases throughput compared to unpaced TCP. However, when 50 clients communicated with 50 servers concurrently, the authors observed that pacing actually caused lower throughput due to a synchronizing effect on all of the flows’ congestion control algorithms: since the flows increase their congestion windows uniformly, they experience congestion events at roughly the same time and enter multiplicative decrease. In aggregate this causes the total throughput to oscillate widely. In contrast, when flows are bursty, each flow experiences a congestion event at a random time, so synchronized drops in throughput do not occur. This surprising result implied that for real-world networks pacing was not preferable to existing, bursty implementations. Another interesting observation the authors make is that TCP Pacing leads to greater fairness among flows sharing a bottleneck link: intuitively when flows are less bursty there is lower probability of one lucky flow “grabbing” a larger fraction of the congestion window.
Though buffers are a necessary component of a network with bursty traffic, large buffers can increase latency dramatically when they are full. In Sizing Routing Buffers (Appenzeller et. al. 2004) the authors propose that the popular rule of thumb for sizing buffers on switches, known as the “bandwidth-delay product” rule, is overly conservative for links with a large number of flows. The motivation for the bandwidth-delay product is that the buffer should be large enough to keep the output link busy when a sender experiences a congestion event and temporarily stops sending packets as it applies multiplicative decrease to its congestion window. In the paper the authors use a statistical approach to show that when many flows are present, the buffer size can be reduced by a factor of the square root of the number of flows and still have high expected utilization and throughput.
In this blog post we explore the above concepts in two ways. First, we attempt to reproduce the results seen by Aggarwal et. al for a single flow and multiple flows, investigating both cumulative throughput and fairness. Since our kernel patch can only be completely enabled or disabled we were unable to reproduce the section examining the interaction of pacing and non-paced flows. Second, we investigate whether pacing has any impact on the Appenzeller et. al.’s “revised rule of thumb” for switch buffer sizing. We provide all of the code and details describing our experimental setup for anyone to verify and expand upon.
Our simulation is conducted using Mininet, an open-source network simulation package. We ran all of our simulations on a c1.xlarge Amazon EC2 instance. While the authors used a different simulation platform, ns2, we chose Mininet for two reasons. First, it was the only platform with which we had any experience and it seemed well-adapted to the paper’s experiments, which did not require high cumulative bandwidth. Second, we thought it would be interesting to compare the two platforms and see if we observed the same behaviour with both. In particular, ns2 is a virtual time simulator, whereas Mininet is a real-time emulator, and clearly time is the key differentiator between paced and unpaced flows.
You can view and download the code used for the simulation here. The README contains instructions for setting up the correct simulation environment.
As stated above, the results in Aggarwal et. al. were generated using an ns2 simulation. We tried to reproduce the ns2 topology as exactly as possible in Mininet, but we had to make some tradeoffs. First, because of Mininet’s scalability issues, we believe it may be safer to run experiments in a topology with 5 sender-receiver pairs each running 10 flows, rather than in a topology with 50 sender-receiver pairs each running one flow. Similarly, we had to reduce all of the bandwidth values, since Mininet performs poorly when the cumulative bandwidth exceeds 3Gbps. Second, because Mininet relies on netem to emulate the network, a packet is only removed from a switch buffer once it has left the “link” (or rather the succession of downstream buffers which in Mininet, and also probably in the original ns2 topology, are replaced by a high-delay link). Accordingly, we had to increase the bottleneck router capacity to obtain graphs that are comparable to the paper using the following formula: B_new = B_old + link_delay * link_rate.
Modified Linux Kernel
The standard Linux kernel does not contain support for TCP pacing. We adapted and updated an old patch (see Acknowledgments below) to work with the 3.5 kernel, and solved some serious bugs with the original implementation. Instructions for installing the patch are included in the simulation code’s README.
When enabled, the new code paces TCP flows by constraining the TCP output engine so that the packets in the congestion window are spread as evenly as possible across the estimated round-trip-time (RTT). Compared to the original paper, our implementation uses the same RTT estimation as the main output engine rather than a separate timer. We also enable pacing on both sender and receiver sides rather than just the sender. We do not believe that either difference results in a significant deviation in behavior.
Running the Simulation
You can run the simulation using the command “sudo ./run.sh” for a single flow and “sudo ./run_mflows.sh” for multiple flows. Parameters such as the number of sender-receiver pairs and the number of flows per pair can be changed directly in the appropriate shell script. A folder will be created for each simulation run (named tcppacing-timestamp for one flow and tcppacing-manyflows-timestamp for multiple flows), and will include the following generated graphs:
- cthroughput.png: cumulative throughput for Pacing and NewReno.
- cwnd.png: congestion windows for Pacing and NewReno.
- fairness.png: fairness with respect to cumulative throughput for Pacing and NewReno.
- rtt.png: RTTs as measured by ping between source and end hosts for Pacing and NewReno.
Below we show Figures 3 and 4 in the TCP Pacing paper, followed by the corresponding results produced by our simulation. They represent a single flow between one sender-receiver pair.
As expected, with a single flow, the cumulative throughput, as measured at the bottleneck router, increases more quickly for TCP Pacing than for NewReno. As observed in the paper, pacing achieves better performance than Reno in the initial period. This is because during the slow-start phase the first loss occurs later for the paced flow, which allows it to get ahead. However, our simulation results differ from the paper in two ways. First, we have not observed as large a difference between the two cumulative throughputs as shown in the paper: in our simulation NewReno slow-start incurs its first packet loss later than in the paper. This is probably due to the fact that in our topology the bottleneck router can handle more burstiness; as explained above, because of netem’s queue management, we had to use a larger queue to obtain similar results. The second difference from the paper is that our congestion window sawtooths for NewReno exhibit decreasing slopes as RTTs increase due to the queue on the bottleneck link becoming full, as one would expect. Pacing does not decrease as much in slope. Interestingly, in the paper there is no clear difference between NewReno and Pacing in this regard.
One additional observation is that in the paper, both flows timeout after the initial slow-start and set their congestion window to 0 before entering slow-start again. On the other hand, in our experiment both flows enter fast recovery. One potential explanation is that the paper (written in 2000 and using ns2) used a slightly different congestion recovery algorithm, which may explain the larger gap in cumulative bandwidths in their results compared to ours.
Below are Figures 6 and 7 from the TCP Pacing paper, followed by corresponding results from our simulation. They represent multiple flows (50) spread between 5 sender-receiver pairs.
For both graphs, our results differ substantially from the paper’s: the cumulative throughput plot is much smoother than in the paper and our fairness plot does not show any improvement when pacing is on. We are very confident in our Pacing implementation, so the explanation must come from elsewhere. Our intuition tells us that this is probably due to a desynchronization of the flows in Mininet. Because ns2 is a simulator and uses virtual time, if all the flows start at the same time and are paced they will be very synchronized, at least during the slow-start phase. In Mininet flows are started sequentially, and the nondeterministic nature of the emulator (it is subject to kernel scheduling and competing processes on the host machine) certainly does not favor synchronization. The method we use to evaluate each flow’s cumulative throughput may also be at cause. Because we initiate several flows on the same host, we cannot measure the total number of bytes sent at an interface (as we are doing for total throughput) but use the tcpprobe log instead, which may be slightly inaccurate.
In most cases, pacing does seem to improve fairness a little during the slow-start phase. Still, there is much variability from one simulation to the next and it is hard to deduce anything significant from the graphs.
To evaluate the impact of pacing on minimum buffer size we used a previous experiment that we had developed for this purpose. Running the experiment with both pacing enabled and disabled resulted in the following plot:
You can reproduce a similar graph by typing the command “sudo ./run_buffersizing.py”. The graph result.png will be placed in a folder called buffersizing-timestamp.
According to this graph, pacing increases the minimum required buffer size. While this size is still below the “revisited rule of thumb”, it can be significantly above the minimum size required for regular TCP NewReno (e.g. for 100 flows). This could be explained by an increased synchronization of the flows, making statistical multiplexing less efficient (higher variability in the sum of all congestion windows). Perhaps 100 flows is too large to have synchronization between flows with TCP NewReno, even if the RTT is the same, but not large enough to avoid synchronization when flows are paced. However, this would be in contradiction with the paper, which claims that in steady state, pacing is more likely to have a de-synchronization effect.
Indeed, the authors claim that in steady state, because of the de-synchronization, throughput is higher with TCP Pacing (section V.B.2). In the experimental setup they use, the delay-bandwidth product is 1250 packets, whereas the buffer size is 312 packets. According to the revised rule of thumb, this amount of buffering should be sufficient since there are 50 flows (requiring 180 packets), meaning that the bottleneck link should never be under-utilized. Our experimental results (PA2) suggest that with regular TCP (CUBIC or Reno) the amount of buffering required is actually even smaller than that. For a similar setup–long iperf flows and seemingly similar RTT and bottleneck bandwidths, with a delay-bandwidth product of 1180 packets instead of 1250 packets–less than 100 packets of buffering was required for regular TCP, as shown in the above graph. This number was even smaller for the real hardware topology, even for flows with the same RTT. It’s impossible to achieve greater than 100% throughput, so how can throughput be better with paced TCP than with regular TCP Reno? It seems there is some form of contradiction between our buffer-sizing experiment’s conclusions and the article’s claims. We believe this warrants further exploration in future work.
In this blog post we have presented two papers related to TCP flow performance and recreated several experiments from those papers. We provide the source code and instructions necessary to reproduce these results on a commodity Amazon EC2 virtual machine.
Though we were able to closely match the TCP Pacing paper’s results for a single flow, our results for multiple flows sharing a bottleneck link did not match. We believe this is due to the random variability in the Mininet environment as compared to ns2. However, we also argue that this more realistically reflects real-world environments, potentially diminishing the authors’ argument that Pacing is not beneficial for widespread use.
To further test this, we investigated the impact of pacing on the “revised rule of thumb” for router buffer sizing. Surprisingly, we found that pacing requires larger buffers than regular TCP Reno for certain numbers of flows. We believe this is due to slight synchronization between paced flows at these levels. For large numbers of flows, however, the difference between pacing and Reno becomes negligible, as one would expect.
The Linux kernel modification for TCP pacing is based on a freely distributed patch created by Daniele Lacamera and released under the GPL. The original patch and more information can be found at the project website.
The buffer sizing experiment code is based on starter code developed for Stanford CS244 (Winter 2013) by the Professors and TAs of that class.