CS244 ’17: iSLIP the Switch Scheduling Problem


Authors:

Kate Stowell <kstowell@stanford.edu>

Travis Lanham <tlanham@cs.stanford.edu>

 

Paper:

McKeown, Nick. “The iSLIP scheduling algorithm for input-queued switches.” IEEE/ACM transactions on networking 7.2 (1999): 188-201.

Introduction

Packet switches are the backbone of modern computer networks and make possible the local area networks used for the distributed applications that make up the modern Internet. The packet switching abstraction is integral to the architecture of the Internet and performant systems require high throughput switches with bounded latency that are simple to implement for cost efficiency.

Designing high speed switches requires an efficient scheduling algorithm to act as an arbiter between input requests and output availability. The throughput of input-queued switches is bounded by the efficiency of the scheduling algorithm, making it an important and rich area for research. The iSLIP algorithm that the paper introduces addresses issues with previous algorithms to achieve close to 100% throughput.

Goals: What problem was the original paper trying to solve?

The original paper sought to expand the family of algorithms that address that head of line blocking problem that input-queued switches face. It builds off previous work that demonstrated that almost 100% throughput was possible on input-queued switches.

The essence of the switch scheduling problem is that cells (packets) enter the switch on a given input and must exit on a certain output. For example, a typical switch might be 16×16 which means it has 16 inputs that it can receive cells on and 16 outputs that it can send cells out. The switch scheduler must decide when an input can use a certain output. A given input can forward at most one cell to an output at one time; there is contention for an output if more than one input is requesting to forward to that output.

The paper aims to resolve this bipartite matching problem in a way that can be implemented easily in hardware with bounded latency to maintain line rate. The paper sought to improve existing switching algorithms to reduce latency for high loads.

The figure below [4] shows the fundamental problem of switch scheduling, namely optimizing the matching M given input requests and available outputs.

Screen Shot 2017-06-03 at 3.04.07 PM.png

Motivation: Why is the Problem Important/Interesting?

The switch scheduling problem is important to the scalability of ethernet switches that form the backbone of local area networks that connect data centers, enterprises, universities, and more. These networks share the characteristic (particularly in the 1990s when the paper was written) of having high intra-network transfer requirements (ex. servers within a datacenter send information to each other for distributed applications, researchers at a university share results with each other).

In order to meet high network throughput requirements, switches must have low latency for packet forwarding. Without performant switches, network latency would be much higher, requiring additional hardware that costs more to realize network designs for the large-scale distributed systems that make the Internet useful. Additionally, multi-media applications that are sensitive to network latency would be prohibitively costly or impossible.

These difficulties are not theoretical, previous switch designs that used FIFO scheduling policies were limited to 58% throughput. The goal of the paper and its predecessors is to find better algorithms to solve the scheduling problem.

Although there has not been much research on switch scheduling in the past decade, it remains a critical part of networking architecture that should be understood and supported with a verified theoretical foundation.

Results: What Did the Original Author Find?

The iSLIP algorithm builds on work done with parallel iterative matching (PIM) for switch scheduling, extending the original PIM algorithm with a round-robin schedule. The paper presented both iterative and noniterative versions of the SLIP scheduling algorithm. The paper shows that SLIP can achieve 100% throughput for uniform traffic, theoretically matching the capability of an output-queued switch. The round-robin matching that SLIP uses prevents input queues from being starved and provides fair access to output queues. In the nonuniform case, SLIP performs similarly to round-robin policy among the busy queues.

RRM and SLIP both keep track of a highest priority element for each input and output that is updated in a round-robin fashion. SLIP differs from RRM in the grant stage. Instead of incrementing the pointer to highest priority element (modulo N) in every round, SLIP only updates the highest priority element for an output if its grant was accepted. PIM is more complicated to implement in hardware because it uses randomness in its grant and accept stages.

Screen Shot 2017-06-03 at 1.26.33 PM.png

Although SLIP is based on RRM, the author found that changing the grant stage allowed for the desynchronization of output arbiters and caused SLIP to outperform RRM. The paper found that SLIP outperforms the PIM algorithm, allowing for higher load (throughput) while maintaining bounded latency. SLIP takes the PIM approach of having each input arbiter with a cell request to an output arbiter with a corresponding response but instead of having the output arbiter choose among multiple input requests at random, SLIP rotates among input arbiters.

Consider a simplified example with 3 input arbiters, Ai, Bi, and Ci, and 3 output arbiters, Ao, Bo, and Co. With a single iteration of SLIP, say Ai requests Ao, Bi requests Ao, and Ci requests  Co. Say the output arbiters make the following grants, Ao grants Ai and Co grants Ci. On the next cycle, say we have the same requests. This time, Ao will grant to Bi since it rotates among inputs. This prevents starvation and helps to find a better maximum match with fewer iterations that PIM which grants inputs randomly.

Subset Goal: What Subset of Results Did You Choose to Reproduce?

We choose to reproduce Figure 5 in the paper (shown below), which compares the FIFO, PIM, RRM, and SLIP scheduling algorithms. The comparison in the paper was done with Bernoulli arrivals and with destinations uniformly distributed over all outputs on a 16×16 switch. We measure the average delay per cell and compare this to the cell delay described in the paper. We used a switch scheduling simulator to verify the performance of the algorithms and test the SLIP algorithm implementation the author used.

Screen Shot 2017-06-03 at 4.38.44 PM.png

Subset Motivation: Why that particular choice?

We chose to reproduce Figure 5 because it is the key result of the paper, depicting the relative performance of SLIP against the other algorithms that it seeks to improve upon. Furthermore, Figure 5 depicts PIM reaching unbounded latency before 70% load. While the High-Speed Switch Scheduling for Local-Area Networks by Anderson et. al. shows similar performance for PIM with one iteration, it claims significantly better performance as the number of iterations approaches log2N, but only graphs average cell latencies below 25. Therefore, we also investigated the performance of PIM and iSLIP with 1,2,3, and 4 iterations. The figure in the paper (the first figure below) is somewhat misleading as it shows the theoretical maximum for iterative matching with infinite iterations which is infeasible for a hardware switch needing to maintain line rate. We found that a single iteration of PIM performed closer to FIFO as seen in the second graph below and verified by our reproduction.

Screen Shot 2017-06-03 at 4.39.21 PM.png

Screen Shot 2017-06-03 at 4.39.40 PM.png

Reproduction Results

First, we tested the simulator using an array of algorithms and configurations and comparing the simulation results to theoretical expectations. Through this we were able to verify the correctness of the simulator for uniform arrival patterns since it can be modeled mathematically.  

Then, using the switch simulator we evaluated the performance of iSLIP, PIM, RRM, and FIFO on a 16×16 switch using bernoulli arrivals and with traffic uniformly distributed across all outputs and loads between twenty and one hundred percent. This resulted in the following graph which closely resembles the Figure from the iSLIP paper.

Screen Shot 2017-06-03 at 4.38.44 PM

fig5.png

In our reproduction of the PIM results, our simulation results closely matched the figure from the original paper. Additionally, we graphed the results for PIM with 1, 2, 3, and 4 iterations on a log scale with higher latency values visible. The same simulation was performed for iSLIP.

pimslogpims (1)islipslog

We also experimented with different arrival patterns (bursty, etc) but found predictably bad performance for them across all algorithms.

Challenges

One of the challenges we faced when reproducing the results of this paper was finding a simulator to test the algorithms. We could not find any simulators for the switch scheduling abstraction. Mininet does not allow configurability of switch scheduling and the switch abstraction it provides cannot be extended to do so. We looked at other simulators like ns2 and found the same issue that none delivered the low level abstraction of switch scheduler simulation. We hypothesize that simulators other than the one we used exist but are proprietary to the switch vendors that made them for research. Additionally, we encountered the problem that little substantial work has been done on the switch scheduling problem since the early 2000s.

We ultimately obtained the switch simulator used by the author, SIM, which simulates switching behavior is software for ATM switches. The simulator has been used in several papers as well as for the EE384X class at Stanford.

Once we had the simulator, we updated it and fixed bugs to get it working. We then tested the simulator against many different configurations and verified its correctness against the mathematical models for the algorithms’ performance.

Another challenge we saw was in our preliminary testing was large standard deviations for latency data and minor deviations from theoretical performance and even the figures in the iSLIP paper. We found that increasing simulation run length smoothed out these deviations (we found 100000 cycles sufficient).

We also extended the simulator, adding a front end harness to make running easier. Our front end generates config files for a run configuration (ports, switches, load, algorithm, time to run) and graphs from the results of one or more runs.

Critique and Conclusions

A general critique we had for the PIM paper [2] was having an ambiguous graph that made it seem as though the performance of PIM was almost equivalent to that of output-queuing when it is so only after log(n) and above iterations and for a single iteration is much closer to FIFO.

We were able to successfully reproduce Figure 5 from the iSLIP paper, our main goal, which illustrated the improvement that iSLIP made over previous algorithms with only a single iteration as well as several other figures that illustrate the performance and qualities of different scheduling algorithms for input-queued switches. Although the focus of our reproduction was not on the feasibility of hardware implementation, it is a crucial topic and we think the papers should have addressed it more openly ([4] goes into the iSLIP hardware details at depth but they were largely left out of the paper), especially since the hardware

In general, it seems that networking research shies away from the details of hardware implementation. This is not generally a negative quality, however, for some papers the hardware details are inextricably tied to the system and should be addressed more. The research community should invite more insight at this level as it fundamentally affects the performance of networking systems.

Overall, we thought it was important to examine the paper even though it was published over 20 years ago since it was a substantial contribution to switch scheduling and marked the last major research publication about it (based on citations).

Reproducing our Results / Platform

Instructions for reproducing our results and the code for our experiments can be found here: https://github.com/lanhamt/switch-scheduling.
We used an Ubuntu 16.04 LTS virtual machine on Google Compute Cloud. We used matplotlib for our graphing frontend to the simulator. A script can be found in the repository for installing needed dependencies as well as instructions for sim.

Once sim is set up, our scripts are installed, and all needed dependencies are ready, there is a run.sh script that will generate the main graphs we reproduced.

To reproduce the graphs for Figure 5, PIM, and iSLIP run the following command:

sudo ./run.sh

This will produce files called fig5.png, pims.png, pimslog.png, islips.png and islipslog.png. These will be in the graphs/ directory. The easiest way to access them if using a remote vm is using python simpleserver which we recommend; it can be run with the following command “sudo python -m SimpleHTTPServer 80” and then accessed in a local web browser using the external ip of the vm.

Our sim scripts accommodate many different configurations (ports) and algorithms so other simulation runs can be done using them beyond the ones needed to generate the figures that run.sh creates.

References

The original SIM website can be found at: https://web.archive.org/web/20100719163607/http://klamath.stanford.edu/tools/SIM/

[1] McKeown, Nick. “The iSLIP Scheduling Algorithm for Input-Queued Switches.” IEEE/ACM Transactions on Networking, 7(2):188–201, 1999.

[2] Anderson, Thomas, Susan Owicki, James Saxe, and Charles Thacker. “High-Speed Switch Scheduling for Local-Area Networks.” ACM Transactions on Computer Systems 11(4), 1993, 319–352.

[3] McKeown, Nick, and Thomas E. Anderson. “A Quantitative Comparison of Iterative Scheduling Algorithms for Input-Queued Switches.” Computer Networks and ISDN systems 30.24 (1998): 2309-2326.
[4] McKeown, Nicholas. Scheduling Algorithms for Input-Queued Cell Switches. Diss. University of California at Berkeley, 1992.

One response to “CS244 ’17: iSLIP the Switch Scheduling Problem

  1. Reproducibility score: 5/5. It was really easy to run your script. On our instance it finished running in about 30 mins, and the figures produced matched the figures in the report/README perfectly. We thought the quality of the analysis was also great (was able to reproduce the figure in the paper very closely). It would have been nicer if you could comment about what you found as a result of extending the iteration graphs to cell latency of 10^3. Is that what you expected/do you think extending the graph gave you some new information the original graph was missing?

Leave a comment