CS244 ’13: High Speed Switch Scheduling

Team : Aditya Gudipati (adityag1@stanford.edu), Dinesh Bharadia(dineshb@stanford.edu)

Topic : Analyze performance of different switch scheduling algorithms under asymmetric and uniform traffic patterns

Key Results :

  • PIM with 4 iterations performs as well as Output Queued Switch under uniform traffic pattern.
  • FIFO scheduling leads to Head-of-line blocking resulting in a maximum throughput of around 50 – 60% under uniform traffic pattern
  • Maximum Weight bipartite-matching performs as well as OutputQueued Switch under different traffic patterns
  • Maximum size bipartite-matching can lead to unstable queues under certain asymmetric traffic patterns
  • Maximal size bipartite-matching with a speedup of 2 performs as well as Output Queued Switch under different traffic patterns


[1]Anderson, Thomas E., et al. “High-speed switch scheduling for local-area networks.” ACM Transactions on Computer Systems (TOCS) 11.4 (1993): 319-352.

[2] McKeown, Nick, Venkat Anantharam, and Jean Walrand. “Achieving 100% throughput in an input-queued switch.” INFOCOM’96. Fifteenth Annual Joint Conference of the IEEE Computer Societies. Networking the Next Generation. Proceedings IEEE. Vol. 1. IEEE, 1996.

[3] http://www.stanford.edu/~ashishg/papers/speedup.ps

[4]  http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=hungarianAlgorithm


Output queued switches achieve the ideal goal of a 100% throughput. However, due to memory bandwidth issues, output queued switches are not feasible if we need a large fan-out. In particular, assuming an NxN output-queued switch, the buffer-size requirement as well as the write-speed on the buffer increases by a factor of N as compared to an input-queued switch.

Thus the aim is to achieve the same performance as an output-queued switch, while having the reduced buffer-size and write speed requirements as an input-queued switch. To achieve a high throughput using Input queued switches, we need to maintain VOQs (Virtual Output Queues) at each input port (to avoid head of line blocking). Moreover, we require intelligent switch scheduling algorithms.

Mininet only simulates Output Queued Switches and hence cannot be used to simulate different scheduling algorithms.  For this reason, we have written a C++ simulator framework from scratch, which can simulate various types of switches (Input Queued, Output Queued, Combined Input Output Queued) using different scheduling algorithms and under different traffic patterns.  Using this simulation framework we then try to reproduce figures 3,4 and 5 from [1]. We also make an attempt to test other well-known claims that were made regarding switch scheduling algorithms since this paper was published (refer to the “key results” section above).
We would first like to walk the reader through the architecture of our simulation framework and briefly describe the important classes involved.  Architecturally, we can simulate an entire network of switches (even though all our results are based on a simple network of a single switch). Assumptions made in the simulation framework :
a) Simple topology : one NxN switch. N nodes generating traffic , each of which is “connected” to an input port of the switch. N nodes consuming traffic, each of which is “connected” to an output port of the switch. We call all these connections “links”. Hence there are N links between traffic generator nodes and input ports and there are N links between traffic sink nodes and output ports.
b) Zero scheduling delay and all delay encountered by a packet is either link delay or queuing delay.
c) All links to be of equal bandwidth and latency – i.e.  1 packet arrives on each input port at every time instant and it takes 1 time instant for a packet to traverse on any link.
A brief description of the classes :
1) Link : Same as the “link” described above. It models the connections between ports of the switch and traffic generator/sink nodes.
2) Source : Any node that can send traffic on a link (subclasses : TrafficGenerator, InputPort)
3) Sink : Any node that can receive traffic on a link (subclasses : TrafficSink, OutputPort)
4) Switch : Equivalent to the actual “switch”. Consists of a vector of InputPorts and OutputPorts. (subclasses : IQSwitch, OQSwitch, CIOQSwitch).
5) InputPort : Splits into 2 subclasses based on whether buffering is available or not.(subclasses : QueuedInputPort (present in IQSwitch and CIOQSwitch), NonQueuedInputPort (present in OQSwitch ))
6) OutputPort : Splits into 2 subclasses based on whether buffering is available or not.(subclasses : QueuedOutputPort (present in OQSwitch and CIOQSwitch), NonQueuedOutputPort (present in IQSwitch ))
A brief description of functions:
1) virtual Source::SendPacket() :The traffic generators create packets and write one of them to the link. The QueuedOutputPorts just pick up a packet from their buffer and write it to the link. The NonQueuedOutputPorts first figure out the QueuedInputPort scheduled for it and take the packet from its buffer.
2) virtual Sink::ReceivePacket(): Reverse functionality to SendPacket(). Packets are read from the link and stored in an internl buffer, except in NonQueuedInputPorts, where the packet is directly written to the buffer of the corresponding OutputPort.
3) virtual Switch::ScheduleSwitch(): Run the scheduling algorithm on the switch. In CIOQ switches, we run the scheduling algorithm twice (speedup = 2) and in OQ switches, this function does not do anything.
All our simulations are run for 50000 time instants. Due to memory constraints and in order to replicate a more realistic scenario, we fix a maximum buffer size on all buffers (input and output). Any further packets are dropped and a count is kept of the number of dropped packets (the fraction of dropped packets is also plotted). Our simulation results are on a 3×3 switch with a maximum buffer size of 2000 packets.
First, we replicate the figures 3 and 5 in [1]. The comparison is between FIFO, PIM (with different number of iterations) and Output Queued Switch. PIM with 4 iterations is good enough to get close to the performance of Output Queued Switches. Also, we confirm that increasing the number of iterations in PIM gives diminishing returns.
Having analyzed PIM, we now focus our attention on other algorithms and compare their performances over a uniform traffic pattern.
Brief description of the scheduling algorithms :
1) FIFO : Only schedules (if it can) the first packet at each input port buffer.
2) Maximal Size Matching : Finds a bipartite matching whose size cannot be increased trivially (by simply adding more edges).
3) Maximum Size Matching : Finds a bipartite matching with the maximum size
4) Maximum Weight Matching : Finds a bipartite matching between input ports and output ports with the maximum weight. The weight of each edge is the number of packets on the corresponding VOQ. We use the algorithm described in [4].
5) Maximal Size matching with speedup of 2 : Same as Maximal Size matching, but the algorithm is run twice every time instant . This requires a Combined Input-Output Queued Switch (with buffers on the output ports too). This was first described in [3].
Under uniform traffic, FIFO scheduling leads to queue buildups and can only ensure about 55% throughput. Rest of the schemes achieve 100% throughput in this particular traffic pattern.
To test the Maximum and maximal size bipartite-matchings under asymmetric traffic, we pick an example traffic pattern discussed in [2] (and shown here). This is a peculiar traffic pattern, which inspite of being admissible, leads to unstable queues under maximum size matching. We observe that the delay of packets with maximum size bipartite-matching grows unbounded at load > 0.85. Maximal size matchings and PIM (under 4 iterations) also exhibit a similar behavior.
As expected, maximum weight matching and maximal size matching (with speedup of 2) achieve 100% throughput. But we observed an interesting fact that even FIFO achieves 100% throughput probably due to minimal Head of line blocking , since there are only 4 flows instead of 9).
  • PIM with 4 iterations performs almost as well as an Output Queued Switch under uniform traffic (from figures under the “PIM analysis” section).
  • FIFO does not ensure maximum throughput under uniform traffic patterns (from figures under the “uniform traffic” section).
  • Maximum size matching, maximal size matching and PIM do not ensure maximum throughput under asymmetric traffic patterns. (from figures under the “asymmetric traffic” section).
  • Overall, we can conclude that, Maximum Weight Matching and Maximal Size Matching (with a speedup of 2) are the only algorithms (out of the ones we analyzed) which achieve 100% throughput under different traffic conditions.
  • Download the code folder from the bitbucket repository listed above.
  • cd <path to repo>/src/
  • ./run.sh
  • The code can take upto 6 hours to run for generating all the graphs.
  • The graphs should be automatically generated in <path to repo>/plots/  .

2 responses to “CS244 ’13: High Speed Switch Scheduling

  1. 5/5. Results accurately matched those in the post above. However, instructions could have been better. The run.sh script was not up to date and we had no idea how long the experiment would run. We then had to clarify this with the group. Overall, you guys have done a good job!

    • thanks for the comment. The bitbucket repository is now updated with the latest run.sh script and we have added proper instructions on the blog post. Hopefully the TAs should not face any problems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s