Post by Luke Hsiao and Jervis Muindi
The BBR congestion-based congestion control algorithm was designed to operate near the optimum operating point, where round-trip time is minimized while delivery rate is maximized. In addition, since BBR is congestion-based, and not loss-based like traditional TCP algorithms, Cardwell et al. show that BBR is able to achieve much better utilization of throughput at higher loss rates than TCP CUBIC. In this report, we seek to verify the performance of the BBR congestion control algorithm in lossy networks. We find that BBR does indeed achieve higher throughputs than CUBIC in lossy networks and show that this behavior holds true across orders of magnitude of bottleneck bandwidths, a variety of round trip times, and even on an LTE cellular link trace. We also observe that BBR performs better than several other TCP congestion control algorithms also included in the Linux kernel.
In this report, we attempt to validate the experimental results of “BBR: Congestion-based Congestion Control”. We start by looking at the goals, motivations and results from the BBR paper.
Goals: What was the original paper trying to solve?
The original paper was trying to find a congestion control approach that would stay as close as possible to optimal network operating point in various network conditions. The network link is operating at the optimal point when bandwidth utilization is maximized and latency and loss are minimized, as shown by Leonard Kleinrock in 1979.
Loss-based congestion control operates by delivering full bottleneck bandwidth at the cost of high delay and frequent packet loss. Furthermore, Cardwell et. al. observed that because round trip propagation delay (RTprop) and bottleneck bandwidth (BtlBw) could be inferred from traces, measurements over time could result in an algorithm which converged with high probability to the optimal operating point. BBR is an algorithm that seeks to converge to this optimal point.
Motivation: Why is the problem important/interesting?
At a high level, the Internet could be moving data much more efficiently. For example, cellular users can experience delays of seconds to minutes and even multi-Gbps infrastructure may deliver data at a few Mbps when transmitting data over intercontinental distances. In addition, traditional congestion control algorithms such as CUBIC have a tough time operating efficiently when there is non-negligible packet loss in the network. This limitation comes from CUBIC’s use of packet loss as a signal for congestion, which can unnecessarily hinder throughput and lead to poor performance.
Importantly, these limitations are not fundamental, but rather are a result of particular implementation choices made in designing TCP congestion control. In the original paper, the BBR authors seek to find an alternative to loss-based congestion control, in an effort to address many of these issues.
The problem is particularly interesting because, if successful, we will be able to utilize the network infrastructure we have much more effectively, while reducing latency experienced by users, without needing to upgrade or change the underlying network itself.
Results: What did the original authors show?
The BBR paper describes a new form of congestion control that is based on actual congestion in the network. The insight from the authors is that their approach estimates bottleneck bandwidth and the minimum propagation delay so as to be able to get as close as possible to the ideal operating point of maximum bandwidth and minimal latency. We highlight a few of the key results below. Please refer to the original paper for more details.
The first result is that BBR can quickly adapt to changes in bottleneck link bandwidth. During operation, BBR spends most of its time continuously probing if more bandwidth has become available. Figure 3 on the left (taken from the original paper) shows BBR reacting to a 2× increase and a 2× decrease to bottleneck bandwidth within about 2 seconds.
The second result is that BBR is able to avoid bloating router buffers unnecessarily and thus is able to keep round trip time (RTT) at a minimum. This is in contrast to CUBIC which continues to bloat the bottleneck buffer until it observes a packet loss event arising from overflowing that buffer. Figure 5 (shown in the top right of the gallery) of the original paper illustrates how CUBIC and BBR behaves in this regard.
Third, and one of the most interesting results, is that BBR outperforms CUBIC for non-negligible loss rates. CUBIC is the default TCP congestion control algorithm used in Linux. Figure 8 of the original paper shows the performance comparison between BBR and CUBIC for varying loss rates.
The authors report that Google has had a very good experience deploying BBR on B4—Google’s datacenter-to-datacenter, high-speed Wide Area Network built primarily from commodity switches that have shallow/small memory buffers. The limited amount of buffering means that there is the occasional packet loss when there are bursts of incoming packets that arrive at the same time. After switching to BBR from CUBIC, Google saw throughput improvements ranging from 2× to as high as 25×.
Finally, the authors acknowledge that there is still work to be done in terms of how BBR interoperates with existing loss-based congestion controls algorithms such as CUBIC. In cases where routers have large buffers, CUBIC senders can drown out BBR. Gracefully mitigating this is an ongoing area of research. For more BBR details and other results, we refer you to the original paper.
Reproducing BBR in Lossy Networks
Our goal for this report was to reproduce the comparison between CUBIC and BBR and how their performance varies across different loss rates (as shown as Figure 8 of the original paper).
In addition to verifying the behavior of BBR compared to CUBIC on a bottleneck link with various loss rates, we also explore the effects of round trip time and bottleneck bandwidth on their performance. We further augment the results of Figure 8 by including comparisons with several other TCP congestion control algorithms, and finally, show that BBR continues to perform better than CUBIC when running on a cellular link trace from a Verizon LTE network.
We chose to reproduce Figure 8 because it is one of the primary results of the paper, and is a large contributing factor of BBR’s ability to better utilize network infrastructure (as seen in Google’s B4 deployment). It is also one of the stronger claims of the paper that clearly shows how BBR differs significantly from the behavior of CUBIC, a loss-based congestion control algorithm and the current default in Linux.
In this section, we describe our experimental setup and the results of our work to reproduce the original figure, and the results from our additional exploration.
Experimental Settings and Platform
For our experiments, we used a virtual machine with Ubuntu 16.04 LTS and v4.11.1 of the Linux kernel, which includes the BBR pluggable kernel module. We then used Mahimahi (version 0.96 at time of publication) as a network emulator to allow us to control bottleneck bandwidths, round trip times, and loss rates. We use a custom Python TCP server and client to create network traffic. The source code for these experiments can be found online at https://github.com/jervisfm/rebbr. In order to make our results also reproducible, we ran on a standard Google Cloud image of Ubuntu 16.04, though the same experiment can be performed on Amazon AWS or a local VM.
For all of the following experiments, we use Mahimahi to emulate a bottleneck link. The original paper didn’t explicitly specify the network buffer sizes used in their experiments. We tried both an infinite router buffer and a buffer sized according to RTT * Bottleneck Link Bandwidth with a 15KB floor and found no differences. So, for simplicity our setup uses an infinite sized buffer on the link. In addition, we increase the maximum receive and send window sizes of TCP up to 6.25MB, to support experiments with high bandwidth delay products (in our case, 100Mbps with an RTT of 500ms).
The duration of the flows we evaluate varies between 30 seconds to 120 seconds and this will be explicitly mentioned in each of the sections below.
Replicating Figure 8
We first replicate the original results of the paper. Following the specifications in the paper, we emulated a 100Mbps bottleneck link with a 100ms RTT and run 60 second flows with 0.001 to 50 percent random packet loss. Our results are shown in the figure below. Like the original paper, we also include the line representing the ideal throughput as the link rate multiplied by the fraction delivered (= 1 – LossRate).
We found that our results agreed with the published figure, with some minor differences in BBR’s performance. For example, note that like the published figure, we find that CUBIC achieves slightly better throughput than BBR for extremely low loss rates (0.001%). We asked Neal Cardwell about this and he mentioned that initial BBR implementation was kept simple and has not been fully optimized yet. Specifically, there are two possible factors: (1) the size of the initial congestion window can result in the current BBR code pacing packets one RTT later than CUBIC would (discussed in this developer thread), and (2) the current implementation of the ProbeRTT mechanism prioritized simplicity over performance, which can result in ~2% penalty in throughput because BBR spends those portions of time with a minimal number of packets in flight.
We also notice that the BBR throughput doesn’t drop off until about 45% loss, unlike the original paper where throughput dropped at 20%. Where CUBIC’s loss tolerance is a property of the algorithm, BBR’s loss tolerance is a configuration parameter. As BBR’s loss rate approaches the peak gain of ProbeBW, the probability of measuring the delivery rate of the true bottleneck bandwidth drops sharply and causes BBR to underestimate the available bandwidth. Initially, we believed that this suggested the BBR configuration parameters as seen in v4.11.1 of the Linux kernel were different than those used in the paper. However, after talking with the authors, we found that the parameters were unchanged. This discrepancy with the results is more likely caused by differences in the implementation of the loss process between mahimahi and the netem-based approach used by the authors.
As a note, we found that enlarging the default maximum values of the receive and send windows in Linux was key to more closely replicating this figure. The default limits in Linux would result in BBR only utilizing about 80% of the available bandwidth at low loss rates.
Comparing BBR to Other TCP Congestion Control Algorithms
Next, we augmented the original Figure 8 by also evaluating several other TCP congestion control algorithms included in Linux. The congestion control algorithms we looked at were BIC, CUBIC, RENO, VEGAS and WESTWOOD. We kept the experimental setup the same—a 100Mbps link with 100ms RTT—and changed the congestion control algorithm used for the socket. For this experiment, we measured 30 second flows. The results of this comparison are available in the figure below:
In all instances, for non-negligible loss rates, BBR outperforms the other available congestion control algorithms in our experiment. The runner ups are BIC and WESTWOOD. BIC is designed for high latency, high bandwidth network and similarly WESTWOOD is designed to better handle networks with high bandwidth delay products of which our experimental setup somewhat resembles. Even so, none perform as well as BBR in utilizing available throughput for non-negligible loss rates.
Evaluating the Effect of Bottleneck Bandwidth and Round Trip Time
We then continued to explore the effect of bottleneck bandwidth and round trip time on the behavior of BBR and CUBIC over a lossy link. As before, we keep the same experimental setup and we only vary the bottleneck bandwidth across several orders of magnitude for both BBR and CUBIC. The timing duration for traces used was 30 seconds. We plot the normalized goodput (the average throughput achieved with respect to the bottleneck bandwidth) and show the results below.
The main trend we notice is that the difference in performance of BBR compared to CUBIC becomes smaller as the bottleneck bandwidth decreases. This trend is expected and can be explained by the fact that in order to maximize throughput, loss-based congestion control algorithms require the loss rate to be less than the inverse square of the bandwidth delay product. Thus, as the bandwidth is reduced, the amount of loss that loss-based algorithms can tolerate increases. We observe that BBR still outperforms CUBIC on most bandwidth settings. At the lowest speed of 0.01Mbps (10Kbps), we find CUBIC and BBR have very similar performance.
Next, we hold bottleneck bandwidth constant at 100Mbps and vary RTT. For this experiment, we measured 120 second flows. Note that the larger RTTs experience spend longer in start-up phases, which lowered their average bandwidth compared to if we had run longer flows. 120 second flows were selected so that reproducing the figure could be done in a reasonable amount of time. We would expect that BBR for all RTTs would converge to a similar curve if we measured flows that were several minutes long and took averages of the results.
Once again, we see that as the Bandwidth Delay Product decreased, CUBIC could tolerate higher loss rates when comparing it to itself. For example, at an RTT of 2ms and 1% loss rate CUBIC achieved a goodput of about 25 Mbps whereas at RTT of 100ms and same 1% loss rate, it had ~1 Mbps.
Overall, we observe that BBR performed well when compared to CUBIC. For instance, at RTT of 2ms and 1% loss rate CUBIC only managed to get 25Mbps of goodput whereas BBR attained an impressive 98 Mbps, which is very close to the link capacity.
BBR vs. CUBIC on a Cellular Link
While all the previous experiments hold RTT, bottleneck bandwidth, or both constant, for this experiment we utilized a 140 second Verizon LTE cellular link trace in which both per-packet delay and throughput vary over time. The source of the trace came from the Mahimahi network simulation tool. In the gallery below, the variation of throughput for the captured trace is shown in a purple background in the left diagram. The per-packet delay over time can be seen on the right.
The performance of CUBIC and BBR over this cellular trace is shown below.
Once again, we see that BBR consistently outperforms CUBIC when the loss rate increases. More research has been performed by Chung et. al. in Netdev comparing CUBIC and BBR over an LTE network, where they find that BBR yields higher throughputs as well.
We highlight some of the most significant challenges we faced in reproducing the results of the BBR paper in this section.
One challenge we had when reproducing these results, stemmed from the default send and receive buffer maximums set in Ubuntu. For example, in our initial RTT experiment, our figure looked like this:
Note that as RTT increased, throughput was significantly decreased, even at low loss rates. We found that this stemmed from the fact that by default, Linux sets hard limits on the maximum receive window and send window size. Thus, even though the bandwidth delay product increased when we increased RTT, the sender and receiver could not further expand their windows to keep enough datagrams in flight to fill the throughput. Adjusting these parameters—or TCP tuning—is often needed for high-bandwidth, high-latency networks. We thank Neal Cardwell for suggesting this receive window limitation as being a potential issue to look at and resolving it made our figures much better.
Another challenge was that we were seeing zero throughput for low bandwidth emulated links (e.g. 100Kbps) while initially running various link speeds experiment. We first thought that this might be due to an issue with our client/server setup however the zero throughput problem also happened in other tools such as trying to wget a file from the internet through the setup Mahimahi environment. This pointed to a configuration issue within Mahimahi. After tweaking options to mm-link program, we found that we were naively sizing the buffer queue size according to RTT * Bottleneck Bandwidth and this turned out to be too small a number when Bottleneck bandwidth value was also small, since we were sending fixed size packets. The solution was to switch to using larger buffers on the bottleneck link. For simplicity, we used the default infinite buffers after testing several larger buffer sizes and noticing no significant differences in the results.
Yet another challenge we faced was dealing with deadlock situations in our client/server setup. On some instances, we had observed that our experiment would intermittently hang and not make any progress. We are able to trace it down to a race between the client and server startup interaction. This race meant that it was possible for the server to wait for a connection from a client that had already tried to connect and exited. The fix was to verify that the server process had started and was listening before starting the client process.
Reproducing Our Results
The steps to reproduce our results are available in the Github repository. The entire experiment process takes approximately 8.5 hours to run. Should you encounter issues, please feel free to contact us by opening a Github Issue on the repository.
This is the direct URL to the README instructions: https://github.com/jervisfm/rebbr#step-by-step-instructions
Our experiments support the claim of the original BBR paper that BBR does a much better job than CUBIC of dealing with higher loss rates across network links. We also found that this performance advantage holds true across a variation of bottleneck bandwidth (100Mbps to 10Kbps), various RTTs (2ms to 500ms) and even when tested on a link trace captured from the Verizon cellular network. Thus for managed networks with non-negligible loss rates, there is some significant efficiency gains to be had from switching to using BBR as the congestion control algorithm.
One area for future work is examining how much of this performance advantage holds in situations where there are multiple non-BBR flows sharing a bottleneck link.
We would like to sincerely thank the BBR authors (Neal Cadwell et al.) for spending time giving us feedback on early results and giving suggestions for issues that might explain some of the behavior we were seeing. In particular, the suggestion to ensure we’re not receive window limited was useful in resolving a number of issues as described in the challenges section. We would also like to thank the CS244 course staff for a great quarter, facilitating interactions with experts from the industry and allowing us to replicate a completely new research result. We also thank Google for generously donating Google Cloud Credits that enabled us to easily run multi-hour experiments.