CS244 ’15: Raft, Understandable Distributed Consensus


Melvin Low (mwlow@cs.stanford.edu)
Gabriel Kho ()

Introduction

Raft is a consensus algorithm for maintaining replicated state across a collection of machines. Consensus algorithms ensure that state remains consistent across all machines even if some of them fail—all machines should agree upon state at any given moment in time. The prototypical example is Paxos, which was introduced in 1989 by Leslie Lamport and has served as the foundation of newer consensus algorithms. Paxos remains the primary algorithm used in classrooms to teach students about consensus.

Unfortunately, the structure of Paxos has made it notoriously hard to understand. In addition, the original paper lacks details that are necessary for real world implementation; in practice, significant changes are made to the algorithm during development. Raft attempts to solve these problems. Its primary goal is to increase understandability of consensus without sacrificing the fault tolerance and performance of Paxos.

Raft was evaluated through an experimental study on students from two universities. The students were presented material on Raft and Paxos and then quizzed on both. Results indicated that Raft was indeed easier to understand than Paxos. In addition, an overwhelming number of students reported that Raft would be easier to implement.

In this project we validate the correctness of Raft with empirical data from our own implementation. We also show that Raft is indeed very feasible for students (like ourselves) to understand and implement, as the original authors had hoped.

Motivation

Performance seems to be the most important consideration in networking. Seldom do you hear anything about understandability. We were intrigued by the existence of an algorithm, Paxos, that had gained such a notoriety for opaqueness that it had engendered a second algorithm, Raft, that was created in large part just to make consensus easier to understand and implement. We set out to implement Raft in order to see whether Raft actually worked, and whether it was indeed much easier to understand than Paxos.

Architecture Design and Challenges

We chose to implement Raft in Haskell because of its strong tendency for correctness when successfully compiled. To that effect our project consists of around 900 lines of Haskell built on top of cloud-haskell, a library for developing distributed programs in the spirit of Erlang. We believe the project would have been significantly more complex if we had used a different language.

Each server is represented by a node on a separate process, which communicates with the others by sending messages. The state on each node is ephemeral and only exists as long as the process is alive. The only persistent state are the applied log entries, which are written to CSV files, one per node. This design is for simplicity; in a production environment, most state would be made to persist.

Because of the ephemeral state, we never kill a node during the operation of our program. We instead simulate node death by exiling the node and severing its communications. However, we killed processes during testing and found our implementation to be remarkably resilient—it still maintained consensus, only took more time to do so.

We allow the user to interact with the Raft cluster. The user is able to select the number of nodes in the cluster. He may then kill and wake nodes up, as well as send messages to any of them. All of the messages are replicated across the cluster and then written to CSV files located in the tmp directory.

The main challenge was to learn Haskell as neither of us had used it before. This was a tenuous process that involved learning everything from basic IO to concurrency and locks. The rest came smoothly afterward. We think this is due to the great emphasis on understandability that the Raft authors put when designing the algorithm.

Testing Methodology

Raft maintains that all live nodes in a cluster eventually come to consensus as long as several conditions hold:

1) A majority of nodes are live
2) broadcast time << election timeout << time between node failures

In our implementation, CSV files represent the current state of the nodes. Thus, all CSV files belonging to live nodes should be eventually consistent.

Testing is simple: we selectively kill nodes, send messages to the remaining nodes, and check that the CSV files of the remaining nodes come to consensus. After we restart the nodes that we have killed, we check that their corresponding CSV files have been updated to match those of the live nodes. Because of the second condition above, we wait a few seconds between test cases to allow the nodes to replicate messages and reestablish consensus.

Results

The correctness of Raft was reflected in our implementation for a variety of test cases. Specifically, we found that as long as the basic properties of Raft were maintained—a majority of nodes were always alive, nodes were not killed too rapidly—all nodes would eventually reach consensus. Consensus was determined by examining and comparing the CSV files written by the nodes.

Because of the randomized nature of Raft, it was hard to test for all edge cases. We leave further testing to the user, who is able to interact with the Raft cluster and kill nodes as he pleases.

We did not have much trouble understanding the Raft algorithm. We sincerely believe that the authors did a commendable job designing the algorithm for understandability.

Instructions for Evaluation

The git repository, as well as instructions for installation and running, can be found here: https://github.com/mwlow/raft-haskell

To verify correctness, after each run, compare the CSV files in the tmp directory to ensure that they are all the same. The automated test (i.e. cabal run test [numNodes]) will automatically call diff on all the CSV files. See Testing Methodology above for an explanation.

Visualization

We have created a small visualization (for < 10 nodes) for the CSV files. Instructions can be found here: https://github.com/mwlow/raft-haskell/tree/master/viz

References

https://raftconsensus.github.io/
http://haskell-distributed.github.io/

One response to “CS244 ’15: Raft, Understandable Distributed Consensus

  1. Though it took a bit long to complete the setup – installing Haskell and all of the cabal-install dependencies. The test results seem great. We
    would give it a (5/5) for being reproducible – it seems the distributed consensus algorithm works because all of the CSV files are the same after the tests ran. We ran the tests with 3, 5, 25, and 100 nodes and it worked each time.

    It is pretty amazing that the authors of the blog post implemented the entire algorithm in Haskell (when they didn’t have any experience in it). Seems the algorithm really does make it easy to understand and implement – getting a distributed consensus algorithm to work correctly is difficult.

    Another feedback is that – it’s not truly distributed until the algorithm is ran on different hosts (rather than processes or threads). It would be nice to see an extension in future years where this runs on an EC2 AutoScaling group where the test involves randomly terminating instances.

    Regards,
    Ben & Xiaoxi

Leave a reply to zxx313 Cancel reply