TL;DR — Raft turns a cluster of unreliable nodes into a single, strongly consistent state machine by electing a leader, replicating logs, and enforcing safety rules. By following the patterns outlined here you can build a high‑availability service that tolerates failures, partitions, and rolling upgrades while keeping the system’s behavior deterministic.
Distributed systems that need to keep a single source of truth—such as key‑value stores, coordination services, or replicated databases—rely on state machine replication (SMR). SMR guarantees that every correct node applies the same sequence of commands, yielding identical states. The missing piece is consensus: a way for nodes to agree on that command order even when some of them crash or messages are delayed. The Raft algorithm, introduced in 2014, offers a clear, understandable approach to consensus, making it a popular choice for production systems like etcd, Consul, and TiDB.
In this article we will:
- Review the fundamentals of SMR and why consensus matters.
- Walk through the three pillars of Raft—leader election, log replication, and safety.
- Discuss architectural decisions that affect high availability (HA) such as storage durability, network partition handling, and cluster scaling.
- Provide concrete code snippets (in Go) that illustrate a minimal Raft node and the integration points you’ll need in a real service.
- Outline testing, observability, and deployment strategies that keep your Raft‑based service reliable at scale.
By the end you should be able to design, implement, and operate a Raft‑backed service that meets enterprise HA requirements.
Foundations of State Machine Replication
Why Consensus Matters
State machine replication works only if all non‑faulty replicas apply the exact same sequence of operations. Without a consensus protocol, two nodes might diverge because:
- Network delays cause messages to arrive out of order.
- Node crashes cause a subset of the cluster to miss a command.
- Leader churn lets two different nodes think they are the authority at the same time.
When divergence occurs, the system can no longer guarantee correctness; reads become nondeterministic, and writes may violate invariants. Consensus eliminates these gaps by providing a single, total order of commands that every correct node must follow.
The SMR Model
In SMR, the system is modeled as a deterministic state machine:
- Input: a command (e.g., “PUT key=value”).
- Transition function: deterministic logic that updates internal state.
- Output: optional response (e.g., success/failure).
If every replica receives the same ordered list of commands, the deterministic transition function guarantees identical outputs and final states. Raft’s job is to ensure that list is built reliably despite failures.
The Raft Algorithm Overview
Raft divides consensus into three relatively independent sub‑problems, each mapped to a clear module in code:
- Leader Election – a single node (the leader) is responsible for receiving client requests and coordinating replication.
- Log Replication – the leader appends client commands to its own log and replicates those entries to followers.
- Safety – Raft enforces invariants that prevent conflicting logs and guarantee that committed entries are never lost.
These modules interact through a small set of RPCs: RequestVote, AppendEntries, and InstallSnapshot. The protocol proceeds in terms (monotonically increasing integers). Each term begins with an election; if a leader wins a majority, it serves until it fails or a higher‑term election supersedes it.
Leader Election
When a node starts, it begins as a follower. If it does not hear from a valid leader within an election timeout (randomized between 150‑300 ms by default), it becomes a candidate, increments its term, and sends RequestVote to all other nodes. A candidate becomes leader if it receives votes from a majority of the cluster.
Key safety rule: a node will grant its vote only once per term, and only to a candidate whose log is at least as up‑to‑date as its own. This prevents a split‑brain scenario where two leaders think they have authority.
Log Replication
The leader receives client commands, appends them to its local log, and issues AppendEntries RPCs to followers. Each AppendEntries contains:
prevLogIndexandprevLogTerm– the index and term of the entry that should immediately precede the new entries.- A slice of new log entries.
leaderCommit– the index of the highest log entry known to be committed.
Followers verify the prevLog* values; if they don’t match, the follower rejects the RPC, prompting the leader to backtrack and resend earlier entries. Once a log entry is stored on a majority of nodes, the leader marks it committed and applies it to its state machine, then notifies followers of the new commit index.
Safety Guarantees
Raft guarantees three safety properties:
- Election Safety – at most one leader per term.
- Log Matching – if two logs contain an entry with the same index and term, then the entries are identical and all preceding entries are also identical.
- Leader Completeness – a leader must contain all entries that were committed in previous terms.
These properties are enforced purely by the voting rules and the AppendEntries consistency check, eliminating the need for complex conflict resolution logic.
Designing a Raft‑Based Service
Implementing Raft is not just about copying the algorithm; you must embed it in a service architecture that meets HA goals.
Choosing the Right Storage Backend
Raft persists two critical pieces of data:
- Current term & voted‑for – small metadata needed for election safety.
- Log entries – potentially large, sequential records.
Options:
| Backend | Pros | Cons |
|---|---|---|
| Write‑Ahead Log (WAL) on local SSD | Low latency, simple to implement (e.g., boltDB, badger) | Limited durability if node loses power before flush |
| Embedded relational DB (SQLite) | ACID guarantees, mature tooling | Higher CPU overhead, larger binary |
| Distributed object store (S3, GCS) for snapshots | Off‑node durability, easy backup | Higher latency, eventual consistency concerns for snapshots only |
A common pattern is to store the term/vote and log in a fast local KV store, while periodically taking snapshots of the state machine and uploading them to durable object storage. Snapshots truncate the log, keeping storage bounded.
Handling Network Partitions
Raft tolerates minority partitions automatically: the partition without a majority cannot elect a leader, so its nodes stay followers and reject client writes. The majority side continues serving requests.
However, you must consider client routing. If a client is connected to a minority node, it will experience timeouts. Strategies:
- Client‑side retry with backoff – let the client reconnect to another node.
- Load balancer that health‑checks leader status – e.g., a sidecar that forwards only to the current leader.
- Read‑only quorum – allow followers to serve stale reads if the application can tolerate slightly out‑of‑date data (Raft supports
ReadIndexfor this).
Scaling the Cluster
Raft’s safety depends on a majority quorum, so the cluster size is usually odd (3, 5, 7). Adding nodes improves fault tolerance but also increases the commit latency because the leader must wait for acknowledgments from a larger majority.
To scale read throughput without hurting write latency, you can:
- Use lease‑based reads (
ReadIndex) that bypass log replication. - Deploy read‑only replicas that accept
ReadIndexrequests but never become leaders (by disabling voting).
If you need massive write scalability, you may need to shard the keyspace and run multiple independent Raft groups, each responsible for a subset of data (as done by TiDB and CockroachDB).
Implementing Raft in Code
Below is a minimal, production‑oriented skeleton in Go. It omits many optimizations (e.g., batching, snapshots) but shows the core flow.
package raft
import (
"context"
"encoding/json"
"log"
"sync"
"time"
)
// Persistent state ---------------------------------------------------------
type PersistentState struct {
CurrentTerm uint64 `json:"current_term"`
VotedFor string `json:"voted_for"` // empty if none
}
// Log entry ---------------------------------------------------------------
type LogEntry struct {
Term uint64 `json:"term"`
Index uint64 `json:"index"`
Command []byte `json:"command"` // opaque application command
}
// RaftNode ---------------------------------------------------------------
type RaftNode struct {
mu sync.Mutex
id string
peers []string
state string // "follower", "candidate", "leader"
ps PersistentState
log []LogEntry
commitIndex uint64
lastApplied uint64
nextIndex map[string]uint64 // for each follower
matchIndex map[string]uint64 // for each follower
applyCh chan<- LogEntry
rpc *RPCServer
electionTimer *time.Timer
}
// NewRaftNode creates a node with persisted state loaded from storage.
func NewRaftNode(id string, peers []string, applyCh chan<- LogEntry) *RaftNode {
rn := &RaftNode{
id: id,
peers: peers,
state: "follower",
applyCh: applyCh,
nextIndex: make(map[string]uint64),
matchIndex: make(map[string]uint64),
}
rn.loadPersistentState()
rn.resetElectionTimer()
go rn.run()
return rn
}
// ---------------------------------------------------------------------
// Persistent state handling (simplified: JSON file on disk)
func (rn *RaftNode) loadPersistentState() {
// In production use a WAL or RocksDB. Here we just read a JSON file.
data, err := os.ReadFile("raft_state_" + rn.id + ".json")
if err == nil {
json.Unmarshal(data, &rn.ps)
}
}
func (rn *RaftNode) savePersistentState() {
data, _ := json.Marshal(rn.ps)
os.WriteFile("raft_state_"+rn.id+".json", data, 0600)
}
// ---------------------------------------------------------------------
// Election logic
func (rn *RaftNode) resetElectionTimer() {
if rn.electionTimer != nil {
rn.electionTimer.Stop()
}
timeout := time.Duration(150+rand.Intn(150)) * time.Millisecond
rn.electionTimer = time.AfterFunc(timeout, rn.startElection)
}
func (rn *RaftNode) startElection() {
rn.mu.Lock()
rn.state = "candidate"
rn.ps.CurrentTerm++
rn.ps.VotedFor = rn.id
rn.savePersistentState()
term := rn.ps.CurrentTerm
rn.mu.Unlock()
votes := 1 // vote for self
var mu sync.Mutex
var wg sync.WaitGroup
for _, peer := range rn.peers {
wg.Add(1)
go func(p string) {
defer wg.Done()
req := RequestVote{
Term: term,
CandidateId: rn.id,
LastLogIndex: rn.lastLogIndex(),
LastLogTerm: rn.lastLogTerm(),
}
var resp RequestVoteResponse
if err := rn.rpc.Call(p, "Raft.RequestVote", req, &resp); err == nil {
if resp.VoteGranted && resp.Term == term {
mu.Lock()
votes++
mu.Unlock()
} else if resp.Term > term {
rn.mu.Lock()
rn.ps.CurrentTerm = resp.Term
rn.ps.VotedFor = ""
rn.state = "follower"
rn.savePersistentState()
rn.mu.Unlock()
}
}
}(peer)
}
wg.Wait()
rn.mu.Lock()
defer rn.mu.Unlock()
if rn.state == "candidate" && votes > len(rn.peers)/2 {
rn.becomeLeader()
} else {
rn.state = "follower"
rn.resetElectionTimer()
}
}
func (rn *RaftNode) becomeLeader() {
rn.state = "leader"
// Initialise nextIndex & matchIndex for each follower
lastIdx := rn.lastLogIndex()
for _, p := range rn.peers {
rn.nextIndex[p] = lastIdx + 1
rn.matchIndex[p] = 0
}
go rn.heartbeatLoop()
}
// ---------------------------------------------------------------------
// Heartbeat / AppendEntries
func (rn *RaftNode) heartbeatLoop() {
ticker := time.NewTicker(50 * time.Millisecond)
defer ticker.Stop()
for {
rn.mu.Lock()
if rn.state != "leader" {
rn.mu.Unlock()
return
}
term := rn.ps.CurrentTerm
rn.mu.Unlock()
for _, peer := range rn.peers {
go rn.replicateTo(peer, term)
}
<-ticker.C
}
}
func (rn *RaftNode) replicateTo(peer string, term uint64) {
rn.mu.Lock()
prevIdx := rn.nextIndex[peer] - 1
prevTerm := rn.termAt(prevIdx)
entries := rn.log[prevIdx:] // may be empty (heartbeat)
req := AppendEntries{
Term: term,
LeaderId: rn.id,
PrevLogIndex: prevIdx,
PrevLogTerm: prevTerm,
Entries: entries,
LeaderCommit: rn.commitIndex,
}
rn.mu.Unlock()
var resp AppendEntriesResponse
if err := rn.rpc.Call(peer, "Raft.AppendEntries", req, &resp); err != nil {
return
}
rn.mu.Lock()
defer rn.mu.Unlock()
if resp.Success {
rn.nextIndex[peer] = req.PrevLogIndex + uint64(len(req.Entries)) + 1
rn.matchIndex[peer] = rn.nextIndex[peer] - 1
rn.updateCommitIndex()
} else if resp.Term > term {
rn.ps.CurrentTerm = resp.Term
rn.state = "follower"
rn.ps.VotedFor = ""
rn.savePersistentState()
} else {
// Decrement nextIndex and retry on next heartbeat
if rn.nextIndex[peer] > 1 {
rn.nextIndex[peer]--
}
}
}
// ---------------------------------------------------------------------
// Commit handling
func (rn *RaftNode) updateCommitIndex() {
// Find the highest index i such that i > commitIndex,
// a majority of matchIndex[] ≥ i, and log[i].term == currentTerm.
for i := rn.lastLogIndex(); i > rn.commitIndex; i-- {
count := 1 // leader itself
for _, p := range rn.peers {
if rn.matchIndex[p] >= i {
count++
}
}
if count > len(rn.peers)/2 && rn.log[i-1].Term == rn.ps.CurrentTerm {
rn.commitIndex = i
rn.applyEntries()
break
}
}
}
func (rn *RaftNode) applyEntries() {
for rn.lastApplied < rn.commitIndex {
rn.lastApplied++
entry := rn.log[rn.lastApplied-1]
rn.applyCh <- entry // hand off to state machine
}
}
// ---------------------------------------------------------------------
// Utility helpers
func (rn *RaftNode) lastLogIndex() uint64 {
return uint64(len(rn.log))
}
func (rn *RaftNode) lastLogTerm() uint64 {
if len(rn.log) == 0 {
return 0
}
return rn.log[len(rn.log)-1].Term
}
func (rn *RaftNode) termAt(idx uint64) uint64 {
if idx == 0 {
return 0
}
if idx-1 < uint64(len(rn.log)) {
return rn.log[idx-1].Term
}
return 0
}
// ---------------------------------------------------------------------
// RPC definitions (simplified)
type RequestVote struct {
Term uint64 `json:"term"`
CandidateId string `json:"candidate_id"`
LastLogIndex uint64 `json:"last_log_index"`
LastLogTerm uint64 `json:"last_log_term"`
}
type RequestVoteResponse struct {
Term uint64 `json:"term"`
VoteGranted bool `json:"vote_granted"`
}
type AppendEntries struct {
Term uint64 `json:"term"`
LeaderId string `json:"leader_id"`
PrevLogIndex uint64 `json:"prev_log_index"`
PrevLogTerm uint64 `json:"prev_log_term"`
Entries []LogEntry `json:"entries"`
LeaderCommit uint64 `json:"leader_commit"`
}
type AppendEntriesResponse struct {
Term uint64 `json:"term"`
Success bool `json:"success"`
}
The snippet demonstrates the core loop: election timeout → candidate → vote gathering → leader → periodic heartbeats that replicate logs. Production implementations would add:
- Snapshot installation (
InstallSnapshotRPC) to truncate logs. - Batching of
AppendEntriesto reduce RPC overhead. - TLS/mTLS for secure inter‑node communication.
- Metrics (e.g., Prometheus counters for term changes, election duration).
Integrating the State Machine
The applyCh channel in the example delivers committed LogEntry objects to the application layer. A simple key‑value store could decode the command and update an in‑memory map:
func applyLoop(applyCh <-chan raft.LogEntry, store map[string]string) {
for entry := range applyCh {
var cmd struct {
Op string `json:"op"`
Key string `json:"key"`
Value string `json:"value,omitempty"`
}
if err := json.Unmarshal(entry.Command, &cmd); err != nil {
log.Printf("invalid command: %v", err)
continue
}
switch cmd.Op {
case "SET":
store[cmd.Key] = cmd.Value
case "DEL":
delete(store, cmd.Key)
}
}
}
Because the state machine runs only on committed entries, it is guaranteed to be deterministic across all nodes.
Testing and Observability
Unit‑Level Tests
- Election simulation – spin up 3 in‑process nodes, drop network messages, verify that at most one leader exists per term.
- Log mismatch handling – inject divergent logs on a follower and ensure the leader backtracks correctly (as described in the Raft paper’s Figure 2).
The open‑source project etcd/raft provides a rafttest harness that can be reused.
Integration Tests
- Network partition – use
tcor a container network to split the cluster, then re‑join and verify that the minority side catches up via log replication. - Snapshot recovery – force a large log, trigger a snapshot, kill a node, restart it, and confirm it loads the snapshot and continues replicating.
Observability
Expose the following Prometheus metrics:
raft_current_termraft_state{node="id",state="leader|follower|candidate"}raft_election_timeout_secondsraft_commit_indexraft_applied_index
Add structured logs for term changes and leader transitions, e.g.:
time="2026-05-12T19:05:00Z" level=info msg="became leader" term=7 node=node-3
These logs help operators diagnose split‑brain incidents quickly.
Deployment Considerations for High Availability
Node Placement
- Spread across failure domains (different racks, zones, or cloud regions) to avoid correlated failures.
- Use anti‑affinity rules in Kubernetes (
podAntiAffinity) so that no two Raft nodes land on the same host.
Rolling Upgrades
Because Raft tolerates the loss of a minority, you can upgrade nodes one at a time:
- Drain the node from the service (stop accepting client traffic).
- Wait for the node to step down to follower (if it was leader, a new election occurs).
- Upgrade the binary and restart.
- Verify the node rejoins the cluster and catches up.
Automation tools like Argo Rollouts or Spinnaker can enforce the “wait for healthy majority” rule.
Disaster Recovery
- Backup snapshots to an off‑site bucket (e.g., AWS S3) daily.
- Store term/vote metadata in a highly durable KV store (e.g., etcd) to survive total data‑center loss.
- Keep a cold standby cluster in another region; on failover, promote it by copying the latest snapshot and restarting nodes with the same cluster ID.
Security
- Enable mutual TLS for all Raft RPCs; each node holds a certificate signed by a private CA.
- Rotate certificates periodically (e.g., every 90 days) and reload without restarting the process.
- Restrict RPC ports (default 5000‑5002) to the internal network using firewall rules.
Key Takeaways
- Raft provides a clear, provably safe way to achieve consensus for state‑machine replication, making it ideal for HA services.
- Leader election, log replication, and safety invariants are isolated concerns that map cleanly to separate code modules.
- Durable storage of term/vote metadata and log entries, combined with periodic snapshots, keeps the system resilient to crashes and disk failures.
- Network partitions are handled automatically by the majority quorum rule; client routing and read‑only quorum strategies improve user experience during outages.
- Production‑grade Raft deployments require careful attention to testing (unit, integration, chaos), observability (metrics, logs), and operational practices (rolling upgrades, cross‑zone placement, security).
Further Reading
- Raft Consensus Algorithm (paper) – the original academic description by Ongaro & Ousterhout.
- etcd/raft repository – a battle‑tested Go implementation with extensive tests.
- Consul Service Mesh documentation – shows how HashiCorp uses Raft for leader election and KV storage.