# CS244 ’13: DCTCP Queue Sizing

Team: Daniel Perry, Sandeep Chinchali

Goal:

Data Center TCP (DCTCP) [1] is a congestion control mechanism that prevents bandwidth hungry background flows from unnecessarily delaying latency sensitive, short foreground traffic in data-centers by consistently keeping queues small. DCTCP achieves this goal by effectively curbing senders’ rates during times of congestion by means of Explicit Congestion Notification (ECN) feedback. When the queue size on a bottleneck link becomes larger than a threshold $K$, DCTCP marks packets with the ECN flag, signaling that the senders must slow down to keep queue sizes small, avoid congestion, and not delay short flows. The frequency at which multiple single-bit ECNs are received allows DCTCP to leverage multi-bit feedback about network congestion.

Motivation:

DCTCP addresses a fundamental problem inherent to data-centers, where latency sensitive short queries must be resolved within strict time deadlines despite the presence of large flows that might dominate scarce buffer space. The DCTCP paper systematically finds experimental evidence for this phenomenon in real data-center workloads and offers a congestion control algorithm to consistently keep queue sizes small.

Results:

The original authors of DCTCP devised a simple algorithm where a receiver marks packets with the ECN flag if the bottleneck link buffer size (in packets) is above a threshold $K$. The senders keep a running estimate of the percentage of marked packets to estimate the extent of congestion. They react by reducing the window size in proportion to the extent of congestion.

The authors mathematically derived the following guideline for choosing the marking threshold $K$, which we experimentally validate:

For an idealized setting of $N$ infinitely long-lived, synchronized flows competing for a bottleneck link of $C$ packets/sec with a roundtrip time of $RTT$ sec, the marking threshold $K$ should satisfy $K > \frac{RTT \times C}{7}$ packets.

Subset Goal and Motivation:

Since the choice of marking threshold $K$ is a critical parameter of DCTCP, we want to determine how robust the congestion control algorithm is to variation in threshold $K$ within an experimental setting. As such, our project proposal is to experimentally validate Equation 13 in [1] which provides a minimum bound on marking threshold $K$ to never underflow the receiver’s buffer ($K > \frac{RTT \times C}{7}$ packets).

Methodology:

In the network emulator Mininet, we created a simple star topology of two senders connected via an ECN-enabled switch to a single receiver via a bottleneck link of capacity $C$. Since our goal was not to test the robustness of DCTCP to various topologies but rather experiment with marking thresholds, such a simple star topology is justified.

Each experiment is characterized by a specific value of $RTT$, bottleneck capacity $C$, and marking threshold $K$. In each experiment, both senders maintain a sustained iperf connection to flood packets through the network to the receiver. We measure the queue size on the bottleneck link after the iperf connection has reached steady state and plot the percentage of time the queue under-runs, which occurs if the congestion control is too cautious and marks packets too early due to a small threshold $K$.

The marking threshold $K$ is implemented using Linux’s RED (Random Early Detection) queueing discipline. In RED, a real-time estimate of the instantaneous queue size is kept and if the size is below a minimum threshold ($minq$), a new packet is always enqueued while if the size is above an upper bound ($maxq$), a new packet is always dropped. If the queue size lies between the bounds of $minq$ and $maxq$, a new packet is enqueued with a probability $p$.

We wish to emulate DCTCP’s behavior, specifically the deterministic dropping of packets if the queue size is above threshold $K$, using the RED discipline. As such, we set the RED parameters as follows, adhering to inherent limitations in RED. We denote the average packet size as $avpkt=1500$ bytes. Since $minq$ and $maxq$ are measured in bytes and taking into account RED’s inability to have $minq = maxq$, we set $mina = K \times avpkt$ and $maxq = (K+1) \times avpkt$. To account for RED’s criterion that the burstiness must be at least $\frac{minq}{avpkt}$, we set $burst = K+1$ and further set probability $p=1$ to deterministically drop packets for DCTCP emulation.

Once the value of threshold $K$ is configured, the bulk of testing consisted of measuring the queue occupancy versus time for various values of $RTT$, $C$, and $K$. We denote $K_{optimal}$ = $\frac{RTT \times C}{7}$ to be the value predicted by the DCTCP authors. By changing $K$ relative to $K_{optimal}$, we can verify if indeed the queue will under-run, and to what extent. To measure the degree at which we deviate $K$ from the cited value of $K_{optimal}$, we plot results against the fraction $\frac{K}{K_{optimal}}$. Based on the DCTCP paper, we should expect the queue to under-run for $\frac{K}{K_{optimal}} < 1$ and the buffer to not overrun when $\frac{K}{K_{optimal}} > 1$.

Results:

In the first set of plots (Figure 1), we plot queue capacity vs. time for various fractions of $\frac{K}{K_{optimal}}$. As expected, the buffer under-runs when $\frac{K}{K_{optimal}} < 1$ and the extent of time the buffer is empty is longer when the fraction is reduced. Such behavior occurs because senders are prematurely warned of congestion due to the small value of $K$ and hence excessively curtail their transmission rates, leading to a loss of throughput.

We see that the buffer is relatively stable and does not under-run when $\frac{K}{K_{optimal}} \ge 2$. Though the paper derives that the buffer should not under-run when $\frac{K}{K_{optimal}} \ge 1$, the authors acknowledge that in practice the threshold may need to be a small factor times $K_{optimal}$ to avoid under-running the buffer as we see experimentally. However, our experimental results do show that the derived threshold of $K_{optimal}$ leads to the buffer rarely under-running, which corroborates the result.

Figure 1: Queue size vs. time for various thresholds $K$. For a link of $RTT = 8~ms$ and $C = 100~Mbps$, $K_{optimal} = 9.52$. We changed threshold $K$ and plot the queue occupancy over time for each $Fraction = \frac{K}{K_{optimal}}$Clearly, the queue starts to become empty for significant time periods when $K$ is much lower than the prescribed threshold of $K_{optimal}$, corresponding to $Fraction < 2$.

The second series of plots (Figure 2) show the robustness of the threshold sizing result for various values of $RTT$ and $C$. For each $RTT$, $C$ combination, denoted by a distinct color, we sweep $K$ over a range such that $\frac{K}{K_{optimal}}$ ranges between ¼ and 10. We then plot the the percentage of the total experiment duration that the buffer was empty versus the marking fraction $\frac{K}{K_{optimal}}$ for various $RTT$, $C$ combinations.

Figure 2: Robustness of Threshold Sizing Result to Various $RTT$ and $C$.

As in Figure 1, the threshold $K$ is changed relative to the prescribed result and we clearly see the occupancy tends to zero for increasing durations as $K$ becomes too small compared to $K_{optimal}$. However, for small thresholds compared to the prescribed result, higher RTTs tend to keep the buffer empty for a longer duration. Though the DCTCP authors do not show this phenomenon, it can be explained by the fact that higher RTTs lead to more latency in the DCTCP feedback mechanism. One should note that the authors do make it very clear that their results hold strictly for the data-center environment (high capacity, low RTT), higher RTTs begin to violate this assumption.

As expected, we clearly see the general trend that the buffer becomes increasingly un-occupied for longer durations as $\frac{K}{K_{optimal}}$ approaches zero. We quantify the percentage of the total experiment duration that the buffer was not empty on the y axis, calculated as $\frac{duration ~ q > 0}{total~experiment~duration}$.The previous results indicate that there exists a transition value for $\frac{K}{K_{optimal}}$ above which the queue will never under-run and below which the queue will indeed under-run. Rather than sweep $K$ over a predefined range, we implement a fine search over thresholds $K$ to determine the minimum value of $K$ such that the queue never under-runs.

Also note in the above plot that we observe constant queue occupancy for a scaling factor >= 2. To further strengthen this result we implement a procedure with strong analogies to a binary search for buffer sizing in [2]. Starting from an initial value of $K$, we measure the percentage of time the queue was nonzero and binary search from there onwards. Rather than enforce a strict criterion that the queue ‘never under-runs’, we implement a soft constraint that ‘the queue never falls to zero more than 5% of the time’ in order to tolerate measurement noise. We find the buffer never under-runs at around $\frac{K}{K_{optimal}} = 2$. Our results are representative since we used a variety of $RTT$, $C$ combinations and swept through several values of $K$ per link setting. We have omitted this procedure from the code on the image as it took hours to run. The above figure should be sufficient in reinforcing this result.

Challenges:

Conceptual challenges were minimal in this project as the authors were extremely clear about the assumptions for deriving threshold $K$. Further, the authors are clear that they assumed that flows were synchronized in the analysis. Since we clearly saw unsynchronized flows, we were able to see less variation in the buffer size and thus tolerate lower values of $K$.

Technical challenges were minimal since the installation of the DCTCP patch on Mininet has been tested before and is well-documented. The most significant technical problem was Mininet can not emulate high-bandwidth links such as 1GBps with high-fidelity, so we cannot validate representative speeds used in the paper and actual data-centers. Additional technical challenges surprisingly mainly came from plotting in python which was something we had not done before.

Critique:

Overall, our experimental validation in Mininet seems to confirm the paper’s threshold sizing result. As the threshold $K$ was brought below the optimal value predicted by the paper, the queue did under-run, resulting in losses of throughput as expected. Further, we needed a safety margin where $K$ had to be at least about 5 times the predicted threshold in order for the buffer occupancy to be stable with minimal fluctuations. This matches the claim made in the paper that $K$ should be larger than $K_{optimal}$ in practice. Overall, the results show the robustness of the threshold sizing result across a variety of $RTT$, $C$ scenarios for a representative topology.

Simulation Platform:

The Mininet network emulator was a natural choice as it allows easy configuring of RED parameters in Linux and can emulate ECN-enabled switches, both of which are integral components to DCTCP. Further, there are a suite of useful utilities for monitoring throughput and queue size which allowed us to focus on the details of validating the robustness of the threshold sizing result.

Our Mininet setup is extremely reproducible as detailed in the README. The parameters that affect reproducibility the most are the choice of $RTT$ and $C$, since these directly affect the predicted value of $K_{optimal}$. Since Mininet cannot emulate high-bandwidth links with great fidelity, we were careful in setting $C$ to a maximum of 100 Mbps.

Instructions to Reproduce:

1. Visit to create an image in the us-west-2 region. Click Launch an Instance.

2. Select Launch Instance.

3. Select Instances -> Instances in the left pane.

4. In the Create a New Instance box, use the Quick Launch Wizard. Name the instance and select a key pair.

5. In Choose a Launch Configuration, select More Amazon Machine Images.

6. Search for our custom Amazon Machine Image (AMI): CS244-Win13-DCTCP_Threshold

7. Continue and use a c1.medium or c1.large instance.

Running the experiments:

1. Run the experiment.sh script (./experiment.sh) within the pa3/ directory. This script runs the experiments and plots all figures to the plots/ directory. An explanation of all code is in the README.

References:

[1] Alizadeh, Mohammad, et al. “Data center tcp (dctcp).” ACM SIGCOMM Computer Communication Review 40.4 (2010): 63-74.

[2] Appenzeller, Guido, Isaac Keslassy, and Nick McKeown. Sizing router buffers. Vol. 34. No. 4. ACM, 2004.