TL;DR — Copy‑on‑write (COW) B‑trees let a database take instant, crash‑consistent snapshots without stopping writes. By versioning leaf and inner pages, you can rebuild the state of the tree at any previous LSN, enabling point‑in‑time recovery (PITR) with low latency and bounded storage overhead.

Databases that promise “instant snapshots” and “point‑in‑time recovery” must reconcile two seemingly opposite goals: write throughput and immutable historical views. Traditional B‑tree implementations rewrite pages in place, which forces a full copy of the affected branch for every snapshot. Copy‑on‑write B‑trees flip that model—pages become immutable once written, and new versions are allocated on demand. The result is a data structure that naturally supports durable snapshots, efficient garbage collection, and deterministic recovery paths. In this article we dissect the engineering behind COW B‑trees, walk through a production‑grade architecture, and provide concrete code snippets that you can adapt to your own storage engine.

Motivation for Durable Snapshots

  1. Operational agility – Developers need to spin up a read‑only copy of production data for analytics, testing, or regulatory audits without impacting OLTP latency.
  2. Disaster recovery – A single‑point failure (e.g., power loss) should not invalidate the last known good state; the system must be able to roll back to any committed transaction.
  3. Regulatory compliance – Financial and health‑care workloads often require retention of data as it existed at a specific timestamp.

Historically, systems achieved these goals with heavyweight mechanisms: full database dumps, block‑level snapshots at the filesystem level, or periodic logical backups. All of them incur O(N) copy cost, where N is the size of the dataset, and they introduce I/O spikes that degrade latency‑sensitive workloads. A COW B‑tree reduces the snapshot cost to O(Δ), where Δ is the number of pages touched since the previous snapshot—typically a few hundred megabytes even for terabyte‑scale databases.

Copy-on-Write B-Tree Fundamentals

How COW differs from traditional B-Tree updates

Traditional B‑TreeCopy‑on‑Write B‑Tree
Updates overwrite existing pages in place.Pages are immutable; a new page is allocated for every modification.
Snapshots require a full page‑level copy.Snapshot is merely a pointer to the current root page.
Garbage collection is manual or absent.Unreferenced pages are reclaimed automatically after all snapshots that reference them expire.
Write amplification is proportional to the height of the tree (≈ log₂N).Write amplification includes the new leaf plus all ancestors on the path to the root (still ≈ log₂N, but each write creates fresh pages).

In a COW B‑tree, the root pointer becomes the version identifier. When a transaction commits, the engine atomically swaps the global root pointer to the newly built root. Any snapshot taken before the swap continues to see the old root, which in turn references a chain of immutable pages that never change.

Node versioning and immutable pages

  • Page size – Most production engines (RocksDB, PostgreSQL) use 4 KB to 16 KB pages. A larger page reduces the depth of the tree but increases the cost of copying a page during a write.
  • Version stamp – Each page header stores a monotonically increasing Log Sequence Number (LSN) and a checksum. The LSN is derived from the transaction commit LSN, guaranteeing that pages are ordered globally.
  • Free list – When a page is no longer reachable from any active snapshot, it is placed on a free list for reuse. The free list itself can be a simple COW linked list to avoid race conditions.

Architecture in Production

Integration with Write-Ahead Logging (WAL)

Even though pages are immutable, the engine still needs a WAL to guarantee durability across power failures. The flow looks like this:

  1. Begin transaction – Allocate a transaction ID (TxID) and record the current root LSN.
  2. Perform writes – For each key modification, allocate a new leaf page (or split existing leaf) and copy the parent chain up to the root, updating pointers to the new child pages.
  3. Write WAL records – Each page allocation and pointer update is logged as a WAL entry containing the page ID, LSN, and payload checksum.
  4. fsync – The WAL is flushed to stable storage. Only after the flush succeeds does the engine update the global root pointer in a commit record (also written to the WAL).
  5. Atomic root swap – The commit record contains the new root page ID. On recovery, the system reads the latest commit record and follows the root pointer to reconstruct the latest tree.

Because the pages themselves are immutable, the WAL can be truncated once all snapshots older than the oldest active LSN have been GC’d. This is a major advantage over traditional B‑trees, where the WAL must retain enough information to undo in‑place page modifications.

Snapshot lifecycle and garbage collection

A snapshot is represented by a lightweight struct:

type Snapshot struct {
    RootPageID uint64 // immutable root of the version
    LSN        uint64 // LSN at which the snapshot was taken
    CreatedAt  time.Time
}

Snapshots are stored in a snapshot registry (often a B‑tree itself) keyed by LSN. The registry is consulted by the background GC worker:

def gc_worker(snapshots, free_list):
    # Determine the oldest LSN that is still needed
    min_lsn = min(snap.LSN for snap in snapshots)
    # Walk the page graph from each snapshot root and mark reachable pages
    reachable = set()
    for snap in snapshots:
        walk_pages(snap.RootPageID, reachable)
    # Any page not in reachable and with LSN < min_lsn can be reclaimed
    for page_id, page in page_store.items():
        if page_id not in reachable and page.lsn < min_lsn:
            free_list.append(page_id)
            del page_store[page_id]

The walk_pages function is a depth‑first traversal that respects the immutable nature of pages; it never needs to lock pages because they cannot change underfoot. In production, the GC runs continuously but throttles its I/O to a configurable budget (e.g., 5 % of disk bandwidth) to avoid interfering with foreground writes.

Patterns in Production

Point-in-Time Recovery workflow

  1. Identify target LSN – The DBA selects a timestamp or transaction ID. The system translates it to the nearest committed LSN.
  2. Locate snapshot – The snapshot registry is queried for the snapshot whose LSN ≤ target LSN and whose next snapshot LSN > target LSN.
  3. Mount read‑only view – A new recovery session is created that uses the snapshot’s root pointer as the entry point. All reads are served from the immutable pages belonging to that version.
  4. Optional forward recovery – If the DBA wants to replay changes from the target LSN forward, the engine can read WAL entries after the target LSN and apply them on top of the snapshot, effectively creating a branch.

This pattern mirrors what PostgreSQL does with its base backup and WAL archiving but eliminates the need for binary file copies; the snapshot is just a root pointer.

Failure modes and mitigations

Failure modeSymptomMitigation
Partial WAL write (crash during fsync)Recovery stops at a partially written WAL record.Use preallocation and checksum validation; discard incomplete records during replay.
Free list reuse raceA page is reclaimed while a long‑running snapshot still references it.GC only recycles pages whose LSN < oldest active snapshot LSN; also employ epoch‑based reclamation to add an extra safety window.
Root pointer corruptionSystem cannot locate any valid root.Store the root pointer redundantly in two locations (e.g., WAL header and a meta‑page) and verify checksums on startup.
Excessive version churn (many tiny snapshots)Free list never empties, leading to storage bloat.Enforce a minimum snapshot retention period (e.g., 5 min) and encourage batch snapshotting for analytical workloads.

Implementation Sketch

Below is a minimal Python prototype that demonstrates the core COW insert logic. It is not production‑ready but highlights the steps you need to implement in a lower‑level language such as Rust or C++.

import struct
from collections import defaultdict

PAGE_SIZE = 4096
MAX_KEYS = 128   # simplified fan‑out

class Page:
    def __init__(self, page_id, lsn, is_leaf):
        self.id = page_id
        self.lsn = lsn
        self.is_leaf = is_leaf
        self.keys = []
        self.children = []   # child page IDs for internal nodes
        self.values = []     # only for leaves

def allocate_page(is_leaf, lsn):
    global next_page_id
    page = Page(next_page_id, lsn, is_leaf)
    next_page_id += 1
    page_store[page.id] = page
    return page

def copy_path(root_id, path, new_leaf):
    """
    Recursively copy nodes along `path` (list of page IDs from root to leaf)
    and replace the leaf with `new_leaf`. Returns new root ID.
    """
    if not path:
        return new_leaf.id
    parent_id = path[0]
    parent = page_store[parent_id]
    new_parent = allocate_page(parent.is_leaf, new_leaf.lsn)
    # copy keys and children, but replace the child that leads to the old leaf
    for i, child_id in enumerate(parent.children):
        if child_id == path[1] if len(path) > 1 else new_leaf.id:
            new_parent.children.append(new_leaf.id)
        else:
            new_parent.children.append(child_id)
    new_parent.keys = list(parent.keys)
    # recurse upwards
    return copy_path(root_id, path[1:], new_parent)

def insert(root_id, key, value, lsn):
    """
    Insert a key/value pair, returning the new root ID.
    """
    # 1. Descend to leaf, collecting the path
    path = []
    node_id = root_id
    while True:
        node = page_store[node_id]
        path.append(node_id)
        if node.is_leaf:
            leaf = node
            break
        # binary search to find child
        idx = 0
        while idx < len(node.keys) and key > node.keys[idx]:
            idx += 1
        node_id = node.children[idx]

    # 2. Create a new leaf with the inserted key/value
    new_leaf = allocate_page(True, lsn)
    new_leaf.keys = leaf.keys + [key]
    new_leaf.values = leaf.values + [value]
    # (splitting omitted for brevity)

    # 3. Copy ancestors up to root
    new_root_id = copy_path(root_id, path[:-1], new_leaf)
    return new_root_id

Traversal of a historic version

Reading a snapshot is a simple depth‑first walk that never mutates state:

def lookup(root_id, key):
    node = page_store[root_id]
    while not node.is_leaf:
        idx = 0
        while idx < len(node.keys) and key > node.keys[idx]:
            idx += 1
        node = page_store[node.children[idx]]
    # leaf node
    if key in node.keys:
        i = node.keys.index(key)
        return node.values[i]
    return None

Because page_store contains only immutable pages, lookup can be called concurrently from any number of snapshot sessions without locks.

Key Takeaways

  • Copy‑on‑write makes snapshots O(Δ): only pages touched since the previous snapshot need to be duplicated, turning a full‑copy operation into a near‑zero‑cost pointer swap.
  • Immutable pages simplify concurrency: readers never block writers, and garbage collection can be performed without complex locking protocols.
  • WAL integration remains essential: the WAL guarantees that newly allocated pages survive power loss; once snapshots are durable, the WAL can be truncated aggressively.
  • Versioned root pointers are the only state needed for PITR: a single 8‑byte root ID identifies the entire database state at any committed LSN.
  • Production‑grade systems must manage free‑list safety and snapshot retention: epoch‑based reclamation and minimum snapshot lifetimes prevent accidental data loss and storage bloat.
  • The pattern scales: Large‑scale services such as CockroachDB, TiDB, and FoundationDB already rely on COW B‑trees (or variants) to provide low‑latency, point‑in‑time restores for multi‑petabyte clusters.

Further Reading