CS244 ’13: pFabric: Datacenter Packet Switching


Background

pFabric is a minimalistic switch design targeting a datacenter setting. Proposed in“Deconstructing Datacenter Packet Transport” (Alizadeh et al.), “pFabric: Minimal Near-Optimal Datacenter Transport” (Alizadeh et al.) it aims to provide a near-optimal performance with regard to flow completion times and flow priorities.
pFabric’s goal is achieving very small latencies and prevent the network from becoming a bottleneck in a datacenter interconnect. Datacenter are often architected to generate large amounts of short flows and at the same time very long flows sometimes take place as well, such as in cases of data backup and recovery operations, database migration etc.
As part of a longer-term ongoing effort to implement the switch on hardware we are reproducing some of the results presented in the paper using the Mininet framework. Up to this moment pFabric as a concept was only evaluated in simulation (using NS-2) and we see its evaluation as part of an actual network stack as an important milestone. In addition, given our proposed software design, oblivious of its interaction with Mininet, it is quite simple to introduce this software switch into an actual datacenter environment for the purpose of gradual integration and testing. Let us illustrate one practical use-case: a software developer has to implement and test a transport protocol for a datacenter architecture using pFabric as its switching fabric. Up to the stage of performance tests it is much more easier to integrate with a software switch that provides convenient statistics and logging, as well as flexibility to change the inspected packet fields that determine flow priority.

Switch Design and Implementation

We considered two main alternatives for the switch implementation:
1. User-mode packet switching emulation – according to this approach we will receive tunneled IP packets from an ingress socket (or raw frames using a raw socket) and send them once again using egress sockets.
2. Packet inspection and scheduling in the kernel stack – according to this approach we will introduce packet inspection and scheduling into the networking stack by loading a kernel module. This approach seemed better because of its lower latency promise.

We implement pFabric switch as a new queuing discipline in the Linux kernel (shortly “qdisc”). A good introduction to packet scheduling and queuing disciplines in Linux is provided inhttp://tldp.org/HOWTO/Traffic-Control-HOWTO/, and more specific description of components is provided in http://tldp.org/HOWTO/Traffic-Control-HOWTO/components.html. An in-depth tutorial about Linux routing and traffic control is available on Linux Advanced Routing and Traffic Control.

Since we handle all flows according to the same switching logic and need to only implement packet dropping and scheduling functionality pFabric is a classless qdisc with priority queues. Although flow priorities are logically separate from the networking layer (IP), in order to simplify implementation and integration with existing tools we piggy-back on IP’s ToS (Type of Service) field to encode the flow priority in a packet. The qdisc inspects this field and enqueues the packet in the appropriate queue according to the assigned priority. We avoided implementing the priority field inspection using the generic u32 filter to optimize performance.
Also, it’s interesting to notice that we could have actually stitched a pFabric switch from almost only existing components such as the u32 filter (cls_u32) for classification and the priority-scheduler (sch_prio). Then we had to implement a policer for dropping arriving packets and evicting already enqueued low priority packets. We have chosen to implement all the functionality in one module to simplify installation and avoid the need to control and configure multiple components.

The current implementation is a simplified version of pFabric. First and most important, it assumes all packets related to the same flow have a similar priority as opposed to the original intention of increasing the flow priority as its remaining size decreases. This assumption was made for simplicity, and full evaluation of pFabric would require a certain change in our design. Once we have discovered the highest priority flow we would have to search the buffered packets related to that flow for the one that was enqueued earliest and send it to prevent priority inversion. Second, we have a much lower priority resolution. Since we use the TOS field in the IPv4 header as a priority indicator we immediately limit ourselves to 256 priorities. More than that, we use iPerf’s “-S” switch to set the flow priority which results in setting the TOS through setsockopt() which for a certain reason allowed setting this field to multiples of 4 only. Since we were more interested at this point in demonstrating the principle, i.e. achieving better flow completion times using pFabric we limited ourselves to 8 priority bands. A higher band number indicates larger flow size and therefore lower priority.

Our queueing discipline identifies IPv4 packets and schedules them according to the pFabric scheduling scheme. Other packets are automatically assigned to band 0 in order to not interfere with other traffic types such as ICMP (ping) that’s used for Mininet connectivity assessment.

Development

One of the main challenges in setting up this experiment was the implementation and testing of the Linux kernel module implementing the queuing discipline. Our kernel module was tested on a virtual machine running Ubuntu 12.04 with Linux kernel version 3.2.0-24. Basic unit-tests were implemented in order to ensure proper functioning of the qdisc. The unit-test version can be compiled by setting the TESTS flag in the Makefile and executed by running test.sh script in the src/kernel directory. The kernel module was also tested for stability by invocation of iperf, ping-ing and browsing. In addition to the kernel module we provide a modified version ofiproute2 tools and specifically tc that is used to attach and control queuing disciplines.

Project source can be cloned from https://bitbucket.org/ymcrcat/cs244-pa3. The modified iproute2 could be cloned from https://bitbucket.org/ymcrcat/iproute2 but there is no need to clone them separately since it is a submodule of the main project.

Evaluation Framework

We aimed at reproducing the principle result of the first pFabric paper (Hotnets), demonstrating the advantage of using pFabric in a high-bandwidth-low-latency datacenter-like environment. We wanted to show that in terms of flow completion times pFabric yields results closer to optimal than other schemes. This seemed to be the main achievement emphasized by the authors of pFabric, and the objective is well defined and measurable. The following plots from the original paper demonstrate better performance of pFabric compared to other schemes. We aim to reproduce this plot.

Screen Shot 2013-03-13 at 11.15.35 PM

We compared TCP over pFabric to standard TCP-DropTail and DCTCP “Data Center TCP (DCTCP)”(Alizadeh et al). To perform the comparison against DCTCP we used a DCTCP enabled EC2 instance. A simple start topology with a single server and multiple hosts is created using Mininet and flows of various sizes are generated using iPerf. At the end of the tests execution flow completion times are extracted from the iPerf sessions outputs.

Unfortunately we haven’t been able to reproduce the results as presented in pFabric paper. pFabric scheduler performance was similar to that of standard TCP. For the sake of performance we did not output too much logging from the kernel module while running it as part of the evaluation setup. A deeper examination of the scheduler behavior is needed in order to ensure it performs well at links rate (although we scaled down the link bandwidth). Also, the execution cycles were long making it hard to experiment and figuring out the correct values for flow sizes, bandwidth etc. via trial and error.
We suspect that since we do not update the flow priority during its lifetime it affects our performance. In that sense our implementation differs from the scheme described in the paper and might lead to different results.

We obtained the following unexpected plot:

eval

We cannot really explain that behavior. Perhaps we face some scheduling inefficiency in the kernel module and need to scale down our bandwidth more to obtain plausible results. pFabric performed particularly poorly in this scenario because we did not dynamically update flow priority based on remaining flow length, but kept them constant based on flow size. This meant that elephant flows with low priority were likely starved, increasing our normalized average flow completion time.

Reproduction Instructions

  1. Create a new EC2 instance based on the DCTCP enabled “CS244_DCTCP_WIN13″ instance.
    This is done by pressing “Launch Instance” in the EC2 dashboard, choosing the Quick Launch Wizard, choosing “More Amazon Images” from the configurations list, pressing Continue and entering DCTCP in the search text-box.
  2. After logging into the EC2 instance you have to clone the Git repository:
    # git clone https://bitbucket.org/ymcrcat/cs244-pa3
  3. Go to the cs244-pa3 directory and read the README file.
  4. As it is mentioned in the README the simple way to install and build the prerequisites for the experiment is by running
    # ./init.sh
    This should install some libraries needed for building iproute2 and specifically tc, initialize the iproute2 submodule and build the pFabric kernel module (if you try to play with pFabric kernel module unit-tests, notice that they might crash on an EC2 instance but should work on a local Linux or on a local virtual machine).
  5. Now you are ready to run the experiment. Take a look at run.sh and notice the different parameters that you may want to play with later such as number of hosts (generating traffic to the server), number of flows generated by each hosts, number of iterations (the more the better but takes more time), etc.
  6. Execute
    # sudo ./run.sh
    and wait for it to finish. It creates a directory named “pfabric-<date-time>” in which results for the different evaluated scheduling schemes are stored. The most interesting is the file evaluation.png which contains a plot of average normalized flow completion times for different loads.

Acknowledgements

Thanks to Prachetaa Raghavan for providing the DCTCP enabled EC2 instance.

References

  1. Linux Advanced Routing and Traffic Control (n.d.). Retrieved from http://lartc.org/.
  2. Components of Linux Traffic Control (n.d.). Retrieved fromhttp://tldp.org/HOWTO/Traffic-Control-HOWTO/components.html.
  3. Linux Traffic Control How-To (n.d.). Retrieved from http://tldp.org/HOWTO/Traffic-Control-HOWTO/.
  4. Alizadeh et al (n.d.), Data Center TCP (DCTCP).
  5. Alizadeh et al. (n.d.), pFabric: Minimal Near-Optimal Datacenter Transport.
  6. Alizadeh et al. (n.d.), Deconstructing Datacenter Packet Transport.
  7. Heller et al. (n.d.), Mininet. Retrieved from http://mininet.github.com/.
Advertisements

2 responses to “CS244 ’13: pFabric: Datacenter Packet Switching

  1. 2.5/5

    The tests were easy to set up and run, although it would have been helpful to know what instance size to run on. We ended up going with c1.xlarge.

    We originally ran everything with the default configuration provided, but we realized (after 3 hours and only 7 iterations on nflows=2) that running the experiment with the default parameters (nruns=100) in run.sh would not finish in a reasonable amount of time.

    We decreased nruns from 100 to 5 and the tests still did not complete after 10 hours and only 4 iterations on nflows=4.

    We also tried nruns=3, and the script ran to completion in 4.25 hours. Unfortunately, the resulting graph did not match the figure in the blog post, with negative values for pFabric. Here is the resulting graph we got: https://reproducingnetworkresearch.files.wordpress.com/2013/03/evaluation.png

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s