CS244 ’13: Improving Datacenter Performance and Robustness with Multipath TCP


Team: Ryan Wilson (rtwilson@stanford.edu), Brian O’Connor (bocon@stanford.edu)

Sources:

  1. C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving Datacenter Performance and Robustness with Multipath TCP. In Proceedings of SIGCOMM 2011.
  2. D. Wischik, C. Raiciu, A. Greenhalgh, M. Handley. Design, Implementation and Evaluation of Congestion Control for Multipath TCP. In proceedings of Usenix NSDI, 2011.

Goals

In a densely connected network, like a data center, there are often multiple paths that a packet can take to reach its destination. Furthermore, in data centers, it is fairly common to oversubscribe the links of aggregation switches and routers in excess of 10:1 to save money. This leads to scenarios where the links high up in the topology hierarchy can become the bottleneck and results in hot spots. Even in topologies that are not oversubscribed, hot spots can still occur because TCP packets are not generally perfectly distributed among the paths.

There are several TCP variants cited by the authors, including Uncoupled TCP (which is not fair to TCP flows), Equally-Weighted TCP and PacketScatter (which assume that all links are equal). The authors from [1] present Multipath TCP (MPTCP) as an alternative which treats paths differently and distinctly, while maintaining TCP fairness. This ensures that the load is evenly distributed regardless of the traffic pattern.

Motivation

Multipath TCP (MPTCP) simulates a single TCP flow by using multiple subflows. When combined with ECMP routing, these subflows can be distributed over multiple network paths. MPTCP should increase throughput and decrease congestion by moving traffic away from hot spots by changing the rate used for each subflow. MPTCP should also provide fairness to TCP flows by maintaining an aggregate view of its subflows’ congestion windows.

This problem is particularly interesting for datacenters where worst-case latency is critical to application performance. Hot spots result in dropped packets or queueing delays that translate to long request/response latencies. This is an important problem because eliminating hot spots allows data center operators to run the links with higher utilization while improving overall throughput and reducing latency. For a business, this can directly translate to lower costs and higher revenues.

Original Results

By using multiple subflows that explore more paths than TCP, MPTCP allows data centers to obtain better utilization and higher throughput for flows under certain network loads. The authors of [1] found that when a FatTree network with a workload that should perfectly utilize the topology (one-to-one), MPTCP throughput approaches 100%, while throughput using TCP is less than 50%.

Figure 1

Figure 1

The authors use packet- and flow-level simulators to test TCP versus MPTCP on a k = 8 FatTree topology. The packet-level simulator is indicated by the gray bars while the corresponding flow simulator is drawn with black bars. A one-to-one workload is tested over the topology, meaning each host sends one flow and receives one flow (the host that each sends to is not necessarily the host it receives from). The throughput for each client-server connection seems to increase logarithmically with the number of flows, levelling off around 95% throughput at 8 subflows. Furthermore, MPTCP also increases flow fairness as all flows achieve at least 70% throughput and greater than half the flows achieve more than 90% throughput. TCP, on the other hand, has flows achieving 10% throughput while very few achieve near 100% throughput. The authors also achieved similar results for a k = 32 FatTree with 8192 hosts (using a flow simulator), and produced results that demonstrated improvement for BCube and VL2 topologies.

Our Goals

Our experiment aims to verify MPTCP performance on several datacenter topologies and different workloads as claimed in [1], specifically for the FatTree topology on a one-to-one workload. Of particular interest is comparing percentage throughput with the number of MPTCP subflows, as well as evaluating fairness of flows by ranking throughput measurements. To expand on the results, we hope to measure other types of workloads such as one-to-several and all-to-all to fully evaluate the range of MPTCP’s effectiveness.

Our Motivation

The FatTree topology is the first one that the authors explore, and it seems like a good starting point for a number of reasons. First, the FatTree topology uses commodity switches and links which are well emulated in our testbed. It is well studied and used in production networks, so it is a relevant topology to experiment on. It is not oversubscribed in the higher layers of the topology, so we should be able to achieve maximum throughput with a one to one workload; however, FatTree uses many links instead of fatter links to ensure it is not oversubscribed. Therefore, this experiment seemed like it would be a good one to test the ability of MPTCP to pool links and to move congestion away from hot spots. Also, the FatTree topology has been tested in Mininet before, and we had access to a controller that supported ECMP routing on the topology.

Our Results

We run TCP and MPTCP with varying number of subflows over the following FatTree topologies (k = 6, 54 hosts and k = 8, 128 hosts) with one-to-one randomized workload and generate the following results:

Figure 2

Figure 2

Figure 3

Figure 3

We are using the network simulator Mininet along with a custom POX controller for ECMP hashed routing. Currently, the maximum throughput is benchmarked using a single TCP flow running over an uncongested network. We should note that we are technically using goodput for all of our host  “throughput” measurements because it is easier to measure using iperf. We also should mention that these experiments were run for 40 seconds with customized iperf to prevent congestion during TCP startup to allow handshakes to be successfully completed. Further, we set the link bandwidth to 1 Mbps and switch buffer sizes to 100 packets.

The average throughput percentages we measured are consistent with the author’s results; they follow the same trend and differ by only a few percentage points. This may be due to the fact we are measuring goodput at the application level instead of measuring throughput at the network level. Our rank of flow curves also nicely coincide with the authors’ results.

Due to the challenges we initially encountered reproducing these results, we measured several conditions in our Mininet network such as link utilization, average RTT, switch buffer size and CPU utilization to help ourselves understand the dynamics of the experiment. We measured the link throughput for every switch interface in our topology, and present the rank of link utilizations (where 100% is the link capacity).

Figure 4

Figure 4

We can see that as we increase the number of subflows, the utilization generally increases across all links. While this isn’t a result that the authors explicitly show, we think that this result demonstrates that MPTCP actively tries to move congestion from more congested links to less congested ones. In the 8 subflow case, we can see that more than 95% of links have utilization above 50%, while in the TCP case, about a third of the links are not used at all. This result supports the paper’s hypothesis that MPTCP increases throughput and improves fairness.

Challenges

For their experiments, the authors specified link bandwidth (100 Mbps) and number of hosts in their Fat Tree topology, but did not specify test time, switch buffer size and host parameters such as CPU, memory and host packet buffers. Due to both the lack of detailed information about their software simulation used in their experiments and the limitations of Mininet for our experiments, we had to manually customize these parameters for reproducibility. We set a test time of 40 seconds, decreased link bandwidth from 100 Mbps to 1 Mbps and only were able to test up to a k = 12 Fat Tree as opposed to k = 32. (Note: The authors only present packet-based simulation results for a k = 8 FatTree; their results for k = 32 FatTree are a flow-based simulation.)

When we began running our experiements, we used a large FatTree, large queue buffers, and link capacities comparable to a modern data centers. We experienced unexpected and erratic results. We realized we needed to scale down the topology and measure various conditions in our network to help pinpoint the reasons for our inability to reproduce the results. We analyzed the effects of CPU utilization, RTT, link bandwidth and queue size.

CPU Utilization

The following graph shows CPU utilization for various FatTree topologies running with varying number of MPTCP subflows on an Amazon EC2 c1.medium instance. Note that CPU is fully utilized for k = 10 and k = 12 FatTree topologies and also on a k = 8 FatTree with 7 MPTCP subflows.

Figure 5

Figure 5

For the corresponding throughput graph (Figure 6) generated for k = 8 FatTree during the same test, we see that the throughput unexpectedly dips for the 7 subflows case, supporting the notion that maxing out CPU lowers flow throughput.

Figure 6

Figure 6

As for fairness among flows, the 8 subflows case is still pretty fair, but there are a few flows that suffer as a result of high CPU load. Figure 5 also shows that increasing the number of subflows also increases CPU utilization, which mirrors the result that the authors found in [1].

We also include the corresponding chart for the k=10 FatTree. This chart shows an unexpected reduction of throughput as the number of MPTCP subflows is increased. We found that both the fairness and utilization are affected by the CPU bottleneck.

Figure 7

Figure 7

LESSON LEARNED: The topology must be scaled to the limitations of the Mininet and the host system.

Bandwidth

The next major obstacle we encountered was scaling the link bandwidth appropriately. A known limitation of Mininet is that the aggregate bandwidth for simulated traffic must be about equal to the clock rate of the processor running Mininet. In our case, the EC2 instance uses a 2 GHz processor, so the maximum aggregate throughput for our experiment should not exceed about 2 Gbps. The number of hosts in a FatTree is at most (k^3)/4, so the number of flows in our one-to-one workload is (k^3)/2. For example with k=4 FatTree with 1 Gbps links, the one-to-one workload is capable of sending 32 Gbps in aggregate (32 flows saturating 1 Gbps links).

In general for the one-to-one workload, we want the link rate to be less than twice the processor speed (in cycles per second) divided by k^3 in bit per second. By scaling the link rate down to 1 Mbps, we should be able to run up to a k = 16 FatTree. (Note: even a k = 4 FatTree would require link speed less than 100 Mbps for this experiment.)

For completeness, we tested a k = 4 FatTree with 1 Gbps links, and we are unable to achieve our theoretical throughput. Here are our results:

Figure 8

Figure 8

The aggregate capacity required for this test is 32 Gbps to achieve 100% throughput. Interestingly, TCP empirically achieves higher throughput in this simulation. There are two factors at work here. First, this experiment is CPU constrained. We have already seen and the authors have shown in [1] that MPTCP increases the CPU load on the hosts. Because CPU resources are shared between switching and host processes, there are few cycles left for the CPU to do packet forwarding with MPTCP enabled. Second, we have seen that the aggregate bandwidth used by few subflows is smaller. By lowering the aggregate bandwidth, we get closer to the Mininet’s limitations and fewer CPU cycles need to be dedicated to switching.

LESSON LEARNED: Links must be scaled such that the aggregate capacity is less than the Mininet limitations for the system.

Queue Size

We then had to set switch queue sizes to appropriate values based on our topology, workload and link bandwidth. By default, switch buffer sizes are unbounded in Mininet. TCP tries to fill queue buffers, and uses dropped packets as an indicator for congestion. The links have high reliability, and when no packets are dropped from the queue, TCP receives no congestion notification. In fact, TCP will continue to fill up the buffers until the RTT is greater than the retranmission timeout. This leads to terrible performance. MPTCP suffers similarly due to using packet drops as a congestion detection mechanism.

In our case, this effect is exacerbated due to our slow link speeds. We found that the propagation delay in the simulator is around 80 microseconds. At 1 Mbps, the queueing delay is 12 milliseconds per packet in the queue. (With 1 Gbps links, the per-packet queueing delay is only 12 microseconds.) This means that the per-packet queuing delay is 2 orders of magnitude higher than the propagation delay in our simulation. This means that total delay is completely dominated by queueing delay. This property is specific to our setup, and not generally the case for modern data centers. Using standard buffer sizing rules of thumb, we should have queue sizes of only a few packets, which results in high loss rates as TCP struggles to grow its window. We empirically determined that have a queue size of roughly 50 to 100 packets limits RTT while still allowing congestion windows to grow.

Here is an test using unbounded queues and 1 Mbps links:

Figure 9

Figure 9

We also monitored the queue sizes for each interface over the course of the experiment. The results for TCP are plotted below:

Figure 10

Figure 10

Each line in Figure 10 shows the queue size for one of the interfaces. As a general trend, we can see that the number of packets in the queues increases over the course of the experiment. The experiment is 40 seconds long, and we don’t see some queue sizes begin to decrease until after iperf stops sending packets. In the worst case, a queue grows to around 450 packets, making the one-way latency over 5 seconds; for a data center, this is terrible.

LESSON LEARNED: Queue sizes need to tuned based on link speed and delay.

RTT

In conjunction with determining how to size our queues, we measured RTT for the FatTree topology for varying number of MPTCP subflows. However, RTT measurements have shown to be erratic, most likely due to fluctuating queue size, amplified by slow links. As stated in the previous section, since queueing delay dominates in our experiment, this is an expected result. However, we were still surprised by the variance of RTT for each test, given that workload had reached equilibrium during the test.

Here is a plot of average RTT between each flow pair in the workload (the error bar depicts one standard deviation):

Figure 11

Figure 11

We also examined the length of each queue over the course of each subflow experiment.

Figure 12

Figure 12

In this case, the queue size was 100 packets and MPTCP was enabled with 8 subflows, and you can see that queue length varies quite dramatically over the course of the experiment (each line corresponds to a different queue), which helps to explain the RTT variance. There are other dynamics at work here that we did not have time to fully diagnose and explain.

Other Workloads

Other than a one-to-one workload, we also wrote one-to-several and all-to-all workload simulations. A one-to-several workload was run on a k = 4 FatTree topology where each host runs an iperf server connected to 4 clients (4 flows per server). Here are the results:

Figure 13

Figure 13

Since each host link acts as a bottleneck and given that 4 flows are competing for the same link, each flow will achieve a maximum throughput of 25%. To fully utilize multiple flows to a host, we need a topology like dual-homed FatTree where each host has 2 network interfaces and can maximize throughput for 2 flows. TCP achieves suboptimal performance with around 19% throughput, while MPTCP achieves moderately better throughput with 6 subflows despite the ill-suited topology for the workload. To fully utilize the benefits of MPTCP, we would like to see different workloads tested on topologies suited to their specific needs in future experiments.

Implementation Challenges

The first challenge we faced was creating a Linux image with Mininet 2.0 and a patched MPTCP-enabled kernel. We tried starting with the base image (Ubuntu 11.10) provided for CS244 Spring 2012 students, but ran into compatibility problems running Mininet 2.0 and newer versions of Open vSwitch. To solve this dilemma, we started fresh with a Ubuntu 12.10 image with Mininet 2.0, and we patched the kernel to enable MPTCP. We’ve scripted the process and made instruction available here: https://github.com/bocon13/mptcp_setup

Because we decided to use Mininet 2.0, we had the small task of porting the example FatTree topology from Mininet 1.0, which can be found in dctopo.py.

Our next task was to run an ECMP controller that would support the FatTree. We used a controller, RipL-POX, which was graciously provided by Brandon Heller. However, we ran into a few problems, which we worked with Brandon to debug. These issues relating to hashing packets in the controller, as well as version mismatches (we were using the betta branch of pox which was not compatible with RipL-POX; master works just fine).

After writing the testing framework, we discovered that CPULimitedHost objects cannot run popen() with non-alphanumeric named nodes due to Mininet’s name validation for process group creation. The nodes in the FatTree topology from the Mininet example code, produces nodes of the form: x_x_x. However, Ubuntu allows for underscores and other non-alphanumeric characters in process group creation. This is an issue that currently should be fixed in Mininet. We avoid this problem by using Host objects instead.

After measuring RTT using ping, we discovered that the RTT was wildly inconsistent for all topologies and number of subflows as displayed earlier. One possible issue is that a reactive RipL-POX controller hashes ICMP packets to 0 while TCP and MPTCP hash to alternate values. This means ping packets take different routes than their MPTCP flow counterparts, which may lead to inaccurate measurements.

We also ran into a problem where we could not kill our POX controller between runs, which was started using Popen in Python. We discovered that although kill() was killing the shell process used to spawn the POX controller, it was not killing the controller itself. We found that running the controller using “exec” causes the controller to replace the shell process, so that it can be killed using kill().

Critique

The main thesis that MPTCP increases throughput on a FatTree topology held well, though our results were a few percent lower than stated in the report. This could be due to the fact we measured goodput from iperf instead of network throughput. It could also be due to the nature of authors’ packet and flow simulators and the parameters set in their experiments. We also note that the authors may have decreased RTO for TCP and MPTCP in order for faster retransmission in a datacenter. This would allow switch queues to be smaller, hence providing less delay and more throughput.

Extensions

As mentioned above, we measured several conditions in our Mininet network such as link utilization, average RTT, switch buffer size and CPU utilization. We tested one-to-several and all-to-all workloads, but needed more time to do a thorough analysis and visualization of our findings.

Platform

We ran our experiments on Amazon EC2 with an Ubuntu 12.10 image, and compiled MPTCP into the stock 3.5 kernel. We used Mininet 2.0 to emulate our topology and hosts, and we used RipL-POX as our OpenFlow switch controller to perform ECMP-hashed routing on the FatTree. Instead of using the stock iperf for TCP bandwidth testing, we used a patched version of iperf which increases the number of concurrent TCP connections and waits for 5 seconds before sending data to allow all iperf processes the opportunity to complete their three-way handshake.

We noticed that the authors used a packet simulator for their smaller FatTree (k=8), and we have found Mininet to be a good packet level simulator. Furthermore, EC2 provides a uniform testbed for others to reproduce our results.

Reproducing our Results

We have made our code and complete instructions available here: https://github.com/bocon13/datacenter_mptcp

We recommend using our MPTCP-enabled AMI because it will save some time that you would spend patching the kernel. You’ll just need to clone our repository, pull the submodules and run our start script.

Please note: Reproducing all of the graphs, specifically graphs for large FatTree topologies, will take a long time (at least 8 hours). You will find that there is little value to the results outside of the reproducible envelope described in our results and challenges sections, and these graphs are fairly inconsistent from run to run.

On the other hand, you can run a subset of our tests, which should produce some reasonable results.

If you are running a c1.medium instance, try running a FatTree with k=4 or k=6 and a one-to-one workload. If you are running a c1.xlarge instance, you can experiment with a FatTree with k = 4, 6, or 8 and a one-to-one workload. Either run should take less than 15 minutes, and you should find that your resulting plot (i.e. ft8-one_to_one-throughtput.png) resembles Figure 2 and 3 above.

You might find the following commands useful:
# runs the experiment with a k=6 FatTree
sudo ./run_mptcp.sh 6

# runs the experiment with a k=8 FatTree
sudo ./run_mptcp.sh 8

IMPORTANT: If you plan to run the same test twice (the same workload on the same topology), make sure you run this command between tests:
sudo rm -rf results/ft<size of FatTree>/<workload>

Feedback

It should be noted that running the example FatTree topology does not allow a CPULimitedHost  to run popen() in Mininet. This is because Mininet uses the host’s name as the name of a process group for its subprocesses. Mininet ensures that the process group name is only alpha-numeric characters and not allowing underscores (used to name hosts in the example FatTree). Mininet’s process group validation code should be edited to allow for more diverse process group names.

Advertisements

5 responses to “CS244 ’13: Improving Datacenter Performance and Robustness with Multipath TCP

  1. The setup instructions and run scripts were clearly described and easy to use. We were able to reproduce the main results from the blog post very easily. Unfortunately we were unable to reproduce the one_to_several results (which, we add, were not part of the original paper) due to apparent race conditions in iperf. (4/5)

  2. I have set up your experiment and while most tests work flawless sometimes the simulations randomly hangs. I nailed down the problem to the call to proc.communicate() line 82 in workloads.py.
    The fix is easy, replace proc.communicate() with proc.terminate() with as far as I can see no obvious side effects. However, as I have not written your code I am unsure if it will create any subtle side effects.

  3. can someone help me? i’m getting error about:

    AssertionError: incompatible sizes: argument ‘height’ must be length 8 or scalar.

    this error i got when i run ./run_mptcp.sh by default.

    any suggestion? please..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s