TL;DR — Raft is a leader‑based consensus protocol that makes distributed state machine replication simple to reason about. It guarantees safety through a replicated log, handles leader election, log replication, and membership changes in a way that can be implemented with a few hundred lines of code.

Distributed systems need a way to keep multiple machines in lockstep despite crashes, network partitions, and message reordering. State machine replication (SMR) provides that guarantee: if every node executes the same deterministic commands in the same order, their state remains identical. Raft is a consensus algorithm specifically designed to drive SMR while remaining understandable and implementable. This article walks through the theory, the core mechanics, and practical nuances of Raft, with code snippets, real‑world examples, and pointers to further reading.

What Is State Machine Replication?

State machine replication is the foundational technique that turns a cluster of unreliable nodes into a single logical service.

  • Deterministic commands – Each client request is transformed into a command that, when applied to a state machine, always yields the same result given the same prior state.
  • Total order – All replicas must apply commands in exactly the same sequence. If order diverges, states diverge.
  • Fault tolerance – By replicating the state machine on N nodes and tolerating up to ⌊(N‑1)/2⌋ failures, the system remains available as long as a majority (a quorum) is reachable.

In practice, SMR is implemented by replicating a log of commands. Each entry in the log records the command and its position (index) and term (the election epoch during which it was created). The log is the single source of truth for ordering; the state machine simply replays the log entries in order.

“A replicated log is the bridge between consensus and state machine replication.” – Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm

The Raft Algorithm Overview

Raft divides the consensus problem into three relatively independent sub‑problems:

  1. Leader election – Selecting a single node that will coordinate log replication.
  2. Log replication – Ensuring the leader’s log entries are copied to followers and committed in a consistent order.
  3. Safety and membership changes – Handling configuration changes without breaking the guarantee that at most one value can be committed for any index.

Raft’s design principle is simplicity: each sub‑problem is expressed in a handful of rules that map directly to code. The algorithm proceeds in terms, numbered monotonically. Each term starts with an election; the elected leader serves until it either fails or steps down.

Terms, Roles, and Persistent State

Every Raft node stores three pieces of persistent state on stable storage (e.g., a disk) so that a crash‑restart does not lose critical information:

# Persistent state on all servers
currentTerm: integer        # latest term server has seen (initialized to 0)
votedFor: integer|null      # candidateId that received vote in current term (or null)
log:
  - index: 1
    term: 1
    command: "set x=5"
  - index: 2
    term: 1
    command: "increment y"

Additionally, each node maintains volatile state:

  • role – one of follower, candidate, or leader.
  • commitIndex – index of highest log entry known to be committed.
  • lastApplied – index of highest log entry applied to the state machine.

Leaders also keep:

  • nextIndex[] – for each follower, the index of the next log entry to send.
  • matchIndex[] – for each follower, the index of the highest log entry known to be replicated.

All of these structures are the basis for the rules that follow.

Leader Election in Raft

When a node starts, it is a follower. If it does not receive any communication from a leader within an election timeout (randomized between 150‑300 ms by default), it transitions to candidate and initiates an election.

Election Steps

  1. Increment term and vote for itself.
  2. Broadcast RequestVote RPCs to all other nodes.
  3. Collect votes. If a candidate receives votes from a majority of nodes, it becomes the leader.
  4. Handle higher terms – if a candidate learns of a term greater than its own (via RequestVote or AppendEntries), it reverts to follower.

The RequestVote RPC includes enough information for a follower to decide whether to grant its vote:

# Simplified RequestVote RPC payload
request = {
    "term": current_term,
    "candidateId": self.id,
    "lastLogIndex": self.log[-1].index if self.log else 0,
    "lastLogTerm": self.log[-1].term if self.log else 0
}

A follower grants its vote if:

  • The candidate’s term is at least as large as its own, and
  • The candidate’s log is at least as up‑to‑date as the follower’s log (the log up‑to‑date check).

The up‑to‑date rule is crucial for safety: it prevents a candidate with a stale log from becoming leader and possibly overwriting committed entries.

Election Timers and Split Votes

Because election timeouts are randomized, the probability of split votes (no candidate achieving a majority) is low. If a split occurs, all candidates time out again, increment the term, and retry. This back‑off mechanism guarantees eventual convergence.

Log Replication and Safety

Once a leader is elected, it begins accepting client commands. Each command is appended to its own log as a new entry, then the leader attempts to replicate the entry on a majority of followers using the AppendEntries RPC.

AppendEntries RPC

{
  "term": 5,
  "leaderId": 3,
  "prevLogIndex": 42,
  "prevLogTerm": 4,
  "entries": [
    {"index": 43, "term": 5, "command": "set z=9"}
  ],
  "leaderCommit": 41
}

The leader includes the index and term of the entry that immediately precedes the new entries (prevLogIndex, prevLogTerm). This allows a follower to detect inconsistencies:

  • If the follower’s log does not contain an entry at prevLogIndex with term prevLogTerm, it rejects the RPC.
  • The leader then backs up (nextIndex) and retries until the logs match.

Commitment Rules

An entry is committed when the leader knows that it is stored on a majority of nodes and its term is the current term. The rule is expressed as:

If a log entry at index i is stored on a majority of servers and its term equals the leader’s current term, then the entry is committed.

The leader advances its commitIndex accordingly and notifies followers via the leaderCommit field in subsequent AppendEntries calls.

Followers apply committed entries to their state machines in order, updating lastApplied. The state machine itself is completely application‑specific; Raft only guarantees the order and durability of the commands.

Safety Proof Sketch

Raft’s safety hinges on two invariants:

  1. Leader Completeness – If a log entry is committed in term t, then that entry is present in the logs of all leaders for terms ≥ t.
  2. Log Matching – If two entries have the same index and term, then they are identical.

Because leaders only commit entries from their own term (and because a majority of nodes must have stored the entry), any subsequent leader must have received that entry during its election (the up‑to‑date check). Thus the entry cannot be overwritten, preserving consistency.

Membership Changes and Joint Consensus

Real clusters need to add or remove servers without sacrificing availability. Raft handles this via a two‑phase configuration change known as joint consensus.

The Two‑Phase Process

  1. Enter joint configuration – The cluster temporarily runs with a combined configuration consisting of the old and new server sets. Both configurations must achieve a majority for progress.
  2. Commit the new configuration – Once the joint configuration is committed, the cluster transitions to the new configuration alone.

This approach guarantees that there is always at least one majority intersecting the old and new sets, preventing split‑brain scenarios.

Example

Suppose we have a 3‑node cluster {A, B, C} and want to add node D. The steps are:

  1. New joint config: {A, B, C, D} (old) + {A, B, C, D} (new) → effectively {A, B, C, D} with both old and new requiring majority.
  2. After the joint config entry is committed, transition to final config {A, B, C, D} (new only).

If a failure occurs during the transition, the joint config still ensures a majority intersection with the original three nodes, preserving safety.

Comparing Raft to Other Consensus Protocols

Raft is often contrasted with Paxos, the seminal consensus algorithm. While Paxos is mathematically elegant, its original description mixes multiple responsibilities, making it hard to implement correctly. Raft separates concerns:

AspectRaftPaxos (Classic)
Leader roleExplicit, always present (except during election)Implicit; any proposer can act
Log replicationIntegrated with leader heartbeatsSeparate from consensus steps
Configuration changesJoint consensus (two‑phase)No built‑in mechanism; external tooling required
UnderstandabilityDesigned for readability (paper + Raft website)Historically considered “hard to understand”

Other modern protocols, such as Viewstamped Replication and Zab (used by Apache ZooKeeper), share many ideas with Raft but differ in details like heartbeat intervals, log compaction, or recovery strategies. In practice, Raft’s clear specification has led to widespread adoption in systems like etcd, Consul, CockroachDB, and TiDB.

Key Takeaways

  • Raft provides a leader‑based approach to consensus that separates election, log replication, and membership changes into simple, well‑defined rules.
  • Terms and the up‑to‑date vote check ensure that only a node with the most complete log can become leader, preserving safety.
  • Log entries are committed only after being stored on a majority of nodes and belonging to the leader’s current term, preventing stale entries from being applied.
  • Joint consensus enables safe configuration changes without risking split‑brain scenarios.
  • Compared to Paxos, Raft trades some theoretical elegance for practical understandability, which has driven its adoption in many production systems.

Further Reading