Architecting Distributed Consensus: Implementing Raft for High Availability in State Machine Replication

TL;DR — Raft turns a cluster of unreliable nodes into a single, strongly consistent state machine by electing a leader, replicating logs, and enforcing safety rules. By following the patterns outlined here you can build a high‑availability service that tolerates failures, partitions, and rolling upgrades while keeping the system’s behavior deterministic.

Distributed systems that need to keep a single source of truth—such as key‑value stores, coordination services, or replicated databases—rely on state machine replication (SMR). SMR guarantees that every correct node applies the same sequence of commands, yielding identical states. The missing piece is consensus: a way for nodes to agree on that command order even when some of them crash or messages are delayed. The Raft algorithm, introduced in 2014, offers a clear, understandable approach to consensus, making it a popular choice for production systems like etcd, Consul, and TiDB.

In this article we will:

Review the fundamentals of SMR and why consensus matters.
Walk through the three pillars of Raft—leader election, log replication, and safety.
Discuss architectural decisions that affect high availability (HA) such as storage durability, network partition handling, and cluster scaling.
Provide concrete code snippets (in Go) that illustrate a minimal Raft node and the integration points you’ll need in a real service.
Outline testing, observability, and deployment strategies that keep your Raft‑based service reliable at scale.

By the end you should be able to design, implement, and operate a Raft‑backed service that meets enterprise HA requirements.

Foundations of State Machine Replication

Why Consensus Matters

State machine replication works only if all non‑faulty replicas apply the exact same sequence of operations. Without a consensus protocol, two nodes might diverge because:

Network delays cause messages to arrive out of order.
Node crashes cause a subset of the cluster to miss a command.
Leader churn lets two different nodes think they are the authority at the same time.

When divergence occurs, the system can no longer guarantee correctness; reads become nondeterministic, and writes may violate invariants. Consensus eliminates these gaps by providing a single, total order of commands that every correct node must follow.

The SMR Model

In SMR, the system is modeled as a deterministic state machine:

Input: a command (e.g., “PUT key=value”).
Transition function: deterministic logic that updates internal state.
Output: optional response (e.g., success/failure).

If every replica receives the same ordered list of commands, the deterministic transition function guarantees identical outputs and final states. Raft’s job is to ensure that list is built reliably despite failures.

The Raft Algorithm Overview

Raft divides consensus into three relatively independent sub‑problems, each mapped to a clear module in code:

Leader Election – a single node (the leader) is responsible for receiving client requests and coordinating replication.
Log Replication – the leader appends client commands to its own log and replicates those entries to followers.
Safety – Raft enforces invariants that prevent conflicting logs and guarantee that committed entries are never lost.

These modules interact through a small set of RPCs: RequestVote, AppendEntries, and InstallSnapshot. The protocol proceeds in terms (monotonically increasing integers). Each term begins with an election; if a leader wins a majority, it serves until it fails or a higher‑term election supersedes it.

Leader Election

When a node starts, it begins as a follower. If it does not hear from a valid leader within an election timeout (randomized between 150‑300 ms by default), it becomes a candidate, increments its term, and sends RequestVote to all other nodes. A candidate becomes leader if it receives votes from a majority of the cluster.

Key safety rule: a node will grant its vote only once per term, and only to a candidate whose log is at least as up‑to‑date as its own. This prevents a split‑brain scenario where two leaders think they have authority.

Log Replication

The leader receives client commands, appends them to its local log, and issues AppendEntries RPCs to followers. Each AppendEntries contains:

prevLogIndex and prevLogTerm – the index and term of the entry that should immediately precede the new entries.
A slice of new log entries.
leaderCommit – the index of the highest log entry known to be committed.

Followers verify the prevLog* values; if they don’t match, the follower rejects the RPC, prompting the leader to backtrack and resend earlier entries. Once a log entry is stored on a majority of nodes, the leader marks it committed and applies it to its state machine, then notifies followers of the new commit index.

Safety Guarantees

Raft guarantees three safety properties:

Election Safety – at most one leader per term.
Log Matching – if two logs contain an entry with the same index and term, then the entries are identical and all preceding entries are also identical.
Leader Completeness – a leader must contain all entries that were committed in previous terms.

These properties are enforced purely by the voting rules and the AppendEntries consistency check, eliminating the need for complex conflict resolution logic.

Designing a Raft‑Based Service

Implementing Raft is not just about copying the algorithm; you must embed it in a service architecture that meets HA goals.

Choosing the Right Storage Backend

Raft persists two critical pieces of data:

Current term & voted‑for – small metadata needed for election safety.
Log entries – potentially large, sequential records.

Options:

Backend	Pros	Cons
Write‑Ahead Log (WAL) on local SSD	Low latency, simple to implement (e.g., `boltDB`, `badger`)	Limited durability if node loses power before flush
Embedded relational DB (SQLite)	ACID guarantees, mature tooling	Higher CPU overhead, larger binary
Distributed object store (S3, GCS) for snapshots	Off‑node durability, easy backup	Higher latency, eventual consistency concerns for snapshots only

A common pattern is to store the term/vote and log in a fast local KV store, while periodically taking snapshots of the state machine and uploading them to durable object storage. Snapshots truncate the log, keeping storage bounded.

Handling Network Partitions

Raft tolerates minority partitions automatically: the partition without a majority cannot elect a leader, so its nodes stay followers and reject client writes. The majority side continues serving requests.

However, you must consider client routing. If a client is connected to a minority node, it will experience timeouts. Strategies:

Client‑side retry with backoff – let the client reconnect to another node.
Load balancer that health‑checks leader status – e.g., a sidecar that forwards only to the current leader.
Read‑only quorum – allow followers to serve stale reads if the application can tolerate slightly out‑of‑date data (Raft supports ReadIndex for this).

Scaling the Cluster

Raft’s safety depends on a majority quorum, so the cluster size is usually odd (3, 5, 7). Adding nodes improves fault tolerance but also increases the commit latency because the leader must wait for acknowledgments from a larger majority.

To scale read throughput without hurting write latency, you can:

Use lease‑based reads (ReadIndex) that bypass log replication.
Deploy read‑only replicas that accept ReadIndex requests but never become leaders (by disabling voting).

If you need massive write scalability, you may need to shard the keyspace and run multiple independent Raft groups, each responsible for a subset of data (as done by TiDB and CockroachDB).

Implementing Raft in Code

Below is a minimal, production‑oriented skeleton in Go. It omits many optimizations (e.g., batching, snapshots) but shows the core flow.

package raft

import (
    "context"
    "encoding/json"
    "log"
    "sync"
    "time"
)

// Persistent state ---------------------------------------------------------

type PersistentState struct {
    CurrentTerm uint64 `json:"current_term"`
    VotedFor    string `json:"voted_for"` // empty if none
}

// Log entry ---------------------------------------------------------------

type LogEntry struct {
    Term    uint64 `json:"term"`
    Index   uint64 `json:"index"`
    Command []byte `json:"command"` // opaque application command
}

// RaftNode ---------------------------------------------------------------

type RaftNode struct {
    mu           sync.Mutex
    id           string
    peers        []string
    state        string // "follower", "candidate", "leader"
    ps           PersistentState
    log          []LogEntry
    commitIndex  uint64
    lastApplied  uint64
    nextIndex    map[string]uint64 // for each follower
    matchIndex   map[string]uint64 // for each follower
    applyCh      chan<- LogEntry
    rpc          *RPCServer
    electionTimer *time.Timer
}

// NewRaftNode creates a node with persisted state loaded from storage.
func NewRaftNode(id string, peers []string, applyCh chan<- LogEntry) *RaftNode {
    rn := &RaftNode{
        id:        id,
        peers:     peers,
        state:     "follower",
        applyCh:   applyCh,
        nextIndex: make(map[string]uint64),
        matchIndex: make(map[string]uint64),
    }
    rn.loadPersistentState()
    rn.resetElectionTimer()
    go rn.run()
    return rn
}

// ---------------------------------------------------------------------
// Persistent state handling (simplified: JSON file on disk)

func (rn *RaftNode) loadPersistentState() {
    // In production use a WAL or RocksDB. Here we just read a JSON file.
    data, err := os.ReadFile("raft_state_" + rn.id + ".json")
    if err == nil {
        json.Unmarshal(data, &rn.ps)
    }
}

func (rn *RaftNode) savePersistentState() {
    data, _ := json.Marshal(rn.ps)
    os.WriteFile("raft_state_"+rn.id+".json", data, 0600)
}

// ---------------------------------------------------------------------
// Election logic

func (rn *RaftNode) resetElectionTimer() {
    if rn.electionTimer != nil {
        rn.electionTimer.Stop()
    }
    timeout := time.Duration(150+rand.Intn(150)) * time.Millisecond
    rn.electionTimer = time.AfterFunc(timeout, rn.startElection)
}

func (rn *RaftNode) startElection() {
    rn.mu.Lock()
    rn.state = "candidate"
    rn.ps.CurrentTerm++
    rn.ps.VotedFor = rn.id
    rn.savePersistentState()
    term := rn.ps.CurrentTerm
    rn.mu.Unlock()

    votes := 1 // vote for self
    var mu sync.Mutex
    var wg sync.WaitGroup

    for _, peer := range rn.peers {
        wg.Add(1)
        go func(p string) {
            defer wg.Done()
            req := RequestVote{
                Term:        term,
                CandidateId: rn.id,
                LastLogIndex: rn.lastLogIndex(),
                LastLogTerm:  rn.lastLogTerm(),
            }
            var resp RequestVoteResponse
            if err := rn.rpc.Call(p, "Raft.RequestVote", req, &resp); err == nil {
                if resp.VoteGranted && resp.Term == term {
                    mu.Lock()
                    votes++
                    mu.Unlock()
                } else if resp.Term > term {
                    rn.mu.Lock()
                    rn.ps.CurrentTerm = resp.Term
                    rn.ps.VotedFor = ""
                    rn.state = "follower"
                    rn.savePersistentState()
                    rn.mu.Unlock()
                }
            }
        }(peer)
    }
    wg.Wait()

    rn.mu.Lock()
    defer rn.mu.Unlock()
    if rn.state == "candidate" && votes > len(rn.peers)/2 {
        rn.becomeLeader()
    } else {
        rn.state = "follower"
        rn.resetElectionTimer()
    }
}

func (rn *RaftNode) becomeLeader() {
    rn.state = "leader"
    // Initialise nextIndex & matchIndex for each follower
    lastIdx := rn.lastLogIndex()
    for _, p := range rn.peers {
        rn.nextIndex[p] = lastIdx + 1
        rn.matchIndex[p] = 0
    }
    go rn.heartbeatLoop()
}

// ---------------------------------------------------------------------
// Heartbeat / AppendEntries

func (rn *RaftNode) heartbeatLoop() {
    ticker := time.NewTicker(50 * time.Millisecond)
    defer ticker.Stop()
    for {
        rn.mu.Lock()
        if rn.state != "leader" {
            rn.mu.Unlock()
            return
        }
        term := rn.ps.CurrentTerm
        rn.mu.Unlock()

        for _, peer := range rn.peers {
            go rn.replicateTo(peer, term)
        }
        <-ticker.C
    }
}

func (rn *RaftNode) replicateTo(peer string, term uint64) {
    rn.mu.Lock()
    prevIdx := rn.nextIndex[peer] - 1
    prevTerm := rn.termAt(prevIdx)
    entries := rn.log[prevIdx:] // may be empty (heartbeat)
    req := AppendEntries{
        Term:         term,
        LeaderId:     rn.id,
        PrevLogIndex: prevIdx,
        PrevLogTerm:  prevTerm,
        Entries:      entries,
        LeaderCommit: rn.commitIndex,
    }
    rn.mu.Unlock()

    var resp AppendEntriesResponse
    if err := rn.rpc.Call(peer, "Raft.AppendEntries", req, &resp); err != nil {
        return
    }

    rn.mu.Lock()
    defer rn.mu.Unlock()
    if resp.Success {
        rn.nextIndex[peer] = req.PrevLogIndex + uint64(len(req.Entries)) + 1
        rn.matchIndex[peer] = rn.nextIndex[peer] - 1
        rn.updateCommitIndex()
    } else if resp.Term > term {
        rn.ps.CurrentTerm = resp.Term
        rn.state = "follower"
        rn.ps.VotedFor = ""
        rn.savePersistentState()
    } else {
        // Decrement nextIndex and retry on next heartbeat
        if rn.nextIndex[peer] > 1 {
            rn.nextIndex[peer]--
        }
    }
}

// ---------------------------------------------------------------------
// Commit handling

func (rn *RaftNode) updateCommitIndex() {
    // Find the highest index i such that i > commitIndex,
    // a majority of matchIndex[] ≥ i, and log[i].term == currentTerm.
    for i := rn.lastLogIndex(); i > rn.commitIndex; i-- {
        count := 1 // leader itself
        for _, p := range rn.peers {
            if rn.matchIndex[p] >= i {
                count++
            }
        }
        if count > len(rn.peers)/2 && rn.log[i-1].Term == rn.ps.CurrentTerm {
            rn.commitIndex = i
            rn.applyEntries()
            break
        }
    }
}

func (rn *RaftNode) applyEntries() {
    for rn.lastApplied < rn.commitIndex {
        rn.lastApplied++
        entry := rn.log[rn.lastApplied-1]
        rn.applyCh <- entry // hand off to state machine
    }
}

// ---------------------------------------------------------------------
// Utility helpers

func (rn *RaftNode) lastLogIndex() uint64 {
    return uint64(len(rn.log))
}
func (rn *RaftNode) lastLogTerm() uint64 {
    if len(rn.log) == 0 {
        return 0
    }
    return rn.log[len(rn.log)-1].Term
}
func (rn *RaftNode) termAt(idx uint64) uint64 {
    if idx == 0 {
        return 0
    }
    if idx-1 < uint64(len(rn.log)) {
        return rn.log[idx-1].Term
    }
    return 0
}

// ---------------------------------------------------------------------
// RPC definitions (simplified)

type RequestVote struct {
    Term         uint64 `json:"term"`
    CandidateId  string `json:"candidate_id"`
    LastLogIndex uint64 `json:"last_log_index"`
    LastLogTerm  uint64 `json:"last_log_term"`
}
type RequestVoteResponse struct {
    Term        uint64 `json:"term"`
    VoteGranted bool   `json:"vote_granted"`
}
type AppendEntries struct {
    Term         uint64     `json:"term"`
    LeaderId     string     `json:"leader_id"`
    PrevLogIndex uint64     `json:"prev_log_index"`
    PrevLogTerm  uint64     `json:"prev_log_term"`
    Entries      []LogEntry `json:"entries"`
    LeaderCommit uint64     `json:"leader_commit"`
}
type AppendEntriesResponse struct {
    Term    uint64 `json:"term"`
    Success bool   `json:"success"`
}

The snippet demonstrates the core loop: election timeout → candidate → vote gathering → leader → periodic heartbeats that replicate logs. Production implementations would add:

Snapshot installation (InstallSnapshot RPC) to truncate logs.
Batching of AppendEntries to reduce RPC overhead.
TLS/mTLS for secure inter‑node communication.
Metrics (e.g., Prometheus counters for term changes, election duration).

Integrating the State Machine

The applyCh channel in the example delivers committed LogEntry objects to the application layer. A simple key‑value store could decode the command and update an in‑memory map:

func applyLoop(applyCh <-chan raft.LogEntry, store map[string]string) {
    for entry := range applyCh {
        var cmd struct {
            Op    string `json:"op"`
            Key   string `json:"key"`
            Value string `json:"value,omitempty"`
        }
        if err := json.Unmarshal(entry.Command, &cmd); err != nil {
            log.Printf("invalid command: %v", err)
            continue
        }
        switch cmd.Op {
        case "SET":
            store[cmd.Key] = cmd.Value
        case "DEL":
            delete(store, cmd.Key)
        }
    }
}

Because the state machine runs only on committed entries, it is guaranteed to be deterministic across all nodes.

Testing and Observability

Unit‑Level Tests

Election simulation – spin up 3 in‑process nodes, drop network messages, verify that at most one leader exists per term.
Log mismatch handling – inject divergent logs on a follower and ensure the leader backtracks correctly (as described in the Raft paper’s Figure 2).

The open‑source project etcd/raft provides a rafttest harness that can be reused.

Integration Tests

Network partition – use tc or a container network to split the cluster, then re‑join and verify that the minority side catches up via log replication.
Snapshot recovery – force a large log, trigger a snapshot, kill a node, restart it, and confirm it loads the snapshot and continues replicating.

Observability

Expose the following Prometheus metrics:

raft_current_term
raft_state{node="id",state="leader|follower|candidate"}
raft_election_timeout_seconds
raft_commit_index
raft_applied_index

Add structured logs for term changes and leader transitions, e.g.:

time="2026-05-12T19:05:00Z" level=info msg="became leader" term=7 node=node-3

These logs help operators diagnose split‑brain incidents quickly.

Deployment Considerations for High Availability

Node Placement

Spread across failure domains (different racks, zones, or cloud regions) to avoid correlated failures.
Use anti‑affinity rules in Kubernetes (podAntiAffinity) so that no two Raft nodes land on the same host.

Rolling Upgrades

Because Raft tolerates the loss of a minority, you can upgrade nodes one at a time:

Drain the node from the service (stop accepting client traffic).
Wait for the node to step down to follower (if it was leader, a new election occurs).
Upgrade the binary and restart.
Verify the node rejoins the cluster and catches up.

Automation tools like Argo Rollouts or Spinnaker can enforce the “wait for healthy majority” rule.

Disaster Recovery

Backup snapshots to an off‑site bucket (e.g., AWS S3) daily.
Store term/vote metadata in a highly durable KV store (e.g., etcd) to survive total data‑center loss.
Keep a cold standby cluster in another region; on failover, promote it by copying the latest snapshot and restarting nodes with the same cluster ID.

Security

Enable mutual TLS for all Raft RPCs; each node holds a certificate signed by a private CA.
Rotate certificates periodically (e.g., every 90 days) and reload without restarting the process.
Restrict RPC ports (default 5000‑5002) to the internal network using firewall rules.

Key Takeaways

Raft provides a clear, provably safe way to achieve consensus for state‑machine replication, making it ideal for HA services.
Leader election, log replication, and safety invariants are isolated concerns that map cleanly to separate code modules.
Durable storage of term/vote metadata and log entries, combined with periodic snapshots, keeps the system resilient to crashes and disk failures.
Network partitions are handled automatically by the majority quorum rule; client routing and read‑only quorum strategies improve user experience during outages.
Production‑grade Raft deployments require careful attention to testing (unit, integration, chaos), observability (metrics, logs), and operational practices (rolling upgrades, cross‑zone placement, security).

Foundations of State Machine Replication#

Why Consensus Matters#

The SMR Model#

The Raft Algorithm Overview#

Leader Election#

Log Replication#

Safety Guarantees#

Designing a Raft‑Based Service#

Choosing the Right Storage Backend#

Handling Network Partitions#

Scaling the Cluster#

Implementing Raft in Code#

Integrating the State Machine#

Testing and Observability#

Unit‑Level Tests#

Integration Tests#

Observability#

Deployment Considerations for High Availability#

Node Placement#

Rolling Upgrades#

Disaster Recovery#

Security#

Key Takeaways#

Further Reading#