TL;DR — Copy‑on‑Write (COW) B‑trees duplicate only the modified path of a tree, leaving the rest untouched. This tiny write‑amplification lets a database snap a consistent view in a single atomic step, without pausing queries or flushing the whole store.

Databases have long needed a way to capture a point‑in‑time view of their data without blocking ongoing reads and writes. Traditional approaches—full file copies, lock‑step checkpoints, or write‑ahead logs—either stall traffic or consume massive I/O. Copy‑on‑Write B‑trees solve this by marrying two classic ideas: the balanced search structure of B‑trees and the immutable‑by‑default semantics of COW. The result is an “instant snapshot”: a read‑only view that appears the moment the write begins, with virtually no extra cost.

What Is a B‑Tree?

A B‑tree is a self‑balancing multi‑way search tree designed for block‑oriented storage (disk or SSD). Its key properties:

  1. Node degree – each internal node holds between t and 2t‑1 keys (where t is the minimum degree).
  2. Height balance – all leaf nodes reside at the same depth, guaranteeing O(log N) lookup, insert, and delete.
  3. Cache‑friendly layout – a node’s size typically matches the storage page size (e.g., 4 KB), minimizing I/O per operation.

Because a B‑tree groups many keys per node, a single disk read can resolve large portions of a query. This makes B‑trees the default index structure in relational databases (PostgreSQL, MySQL), key‑value stores (RocksDB, LevelDB), and even file systems (Btrfs).

Copy‑on‑Write Fundamentals

Copy‑on‑Write (COW) is a memory‑management technique that treats data as immutable until a modification is required. The classic COW workflow:

read  -> return existing page
write -> allocate new page
         copy original content
         apply delta
         update pointer atomically

The crucial benefit is that readers never see a partially updated page. When a writer finishes, it swaps a single pointer (or version tag) to make the new page visible. This atomic pointer swap is the heart of instant snapshots: the old version remains reachable for any reader that started before the swap.

COW is the backbone of many modern systems:

  • ZFS and Btrfs file systems use COW to implement filesystem‑level snapshots.
  • LMDB (Lightning Memory‑Mapped Database) stores the entire database in a single memory‑mapped region, relying on COW for transaction isolation.
  • Git stores objects as immutable blobs, creating new objects only when content changes.

Merging COW with B‑Trees

When you combine COW with a B‑tree, you get a persistent data structure: each mutation yields a new root node while preserving the previous version. The algorithm works like this:

def cow_insert(root, key, value):
    # 1. Clone the path from root to leaf
    new_root = clone_node(root)
    node = new_root
    stack = []                     # keep track of cloned ancestors

    # 2. Descend the tree, cloning on the way
    while not node.is_leaf():
        i = node.search_position(key)
        child = node.children[i]
        cloned_child = clone_node(child)
        node.children[i] = cloned_child
        stack.append(node)
        node = cloned_child

    # 3. Insert into the leaf (which is already a fresh copy)
    node.insert_into_leaf(key, value)

    # 4. Rebalance if needed, climbing back up the stack
    while stack:
        parent = stack.pop()
        if node.needs_split():
            left, right, median = node.split()
            parent.insert_child(median, right)
            node = left               # continue fixing up the left side
        else:
            break
        node = parent

    return new_root

Key observations:

  • Only the nodes on the search path are cloned; the rest of the tree stays shared between versions.
  • The root pointer is the only global state that changes, making the snapshot operation a single atomic write.
  • Because each node is page‑aligned, the clone operation is essentially a page copy—fast on modern storage devices.

When a transaction commits, the database writes the new root pointer to a metadata page (often called the “superblock”). All readers that started before this write continue to follow the old root, while new readers see the new root instantly. No lock is required beyond the atomic pointer store.

Instant Snapshots in Practice

PostgreSQL’s MVCC vs. COW B‑Tree Snapshots

PostgreSQL implements Multi‑Version Concurrency Control (MVCC) by storing tuple versions inside heap pages, not by cloning the index structure. While MVCC gives each transaction a consistent view, creating a true point‑in‑time snapshot still requires a heavyweight checkpoint or a base backup.

Contrast this with RocksDB (a fork of LevelDB) which uses a COW B‑tree called a Log‑Structured Merge‑Tree (LSM) for its memtable and a COW B‑tree for its BlockBasedTable format. When RocksDB compacts data, it writes new SST files and updates a manifest that points to the new root. The manifest update is the instant snapshot: any process that reads the manifest after the write sees the new version; processes that already opened the old manifest keep seeing the older state.

LMDB’s Zero‑Copy Snapshots

LMDB stores the entire B‑tree in a memory‑mapped file. Readers use the current root pointer stored in the meta page. When a writer commits, it writes a new root pointer to a fresh meta page and then atomically swaps the meta page number in the file header. Because the data pages themselves are never overwritten (they’re COW), readers continue to access the old pages without any copy. The snapshot appears instantly, and the writer incurs only the cost of copying the modified pages (typically a few kilobytes).

Example: Creating a Consistent Backup

# Step 1: Freeze the meta page (no need for a lock)
mdb_copy -V /var/lib/mydb data.backup
# Step 2: The copy now contains a point‑in‑time snapshot

The mdb_copy utility simply copies the file while the database may still be serving reads and writes. The COW guarantees that the on‑disk representation never changes under the copy operation.

Trade‑offs and Performance Considerations

AspectBenefits of COW B‑TreePotential Downsides
Write AmplificationOnly pages on the write path are duplicated; typical factor ≈ log N per update.For bulk inserts, duplicated pages can accumulate, leading to higher storage usage if compaction is delayed.
Read LatencyReads follow a stable tree; no lock contention.If many versions coexist, the system may need to traverse deeper trees to find the correct version, slightly increasing lookup cost.
Space OverheadVersions share unchanged pages, so overhead is proportional to changed data.Long‑running transactions can prevent reclamation of older pages, causing “version bloat”.
Garbage CollectionSimple reference‑counting or epoch‑based reclamation can free pages once no transaction holds them.Implementing safe reclamation across threads/processes adds complexity (e.g., hazard pointers).
Crash RecoveryThe atomic meta‑page swap is crash‑safe; on restart, the latest valid meta page is used.Requires a reliable write‑ahead log for the meta pages themselves, otherwise a power loss could corrupt the root pointer.

Mitigating Version Bloat

Most production systems employ periodic checkpointing or compaction:

  1. Checkpoint – Write a new meta page that discards all but the latest root, then truncate older log segments.
  2. Compaction – Merge adjacent pages, rewrite them into fresh pages, and update pointers. This is similar to LSM tree compaction but works on B‑tree pages.

Both operations can be run in the background, preserving the instant‑snapshot property for client transactions.

Key Takeaways

  • COW B‑trees duplicate only the modified path, making a snapshot as cheap as writing a single root pointer.
  • Readers never block writers because they follow immutable pages, while writers work on fresh copies.
  • Instant snapshots enable zero‑downtime backups, point‑in‑time recovery, and multi‑version reads without costly checkpoints.
  • Performance trade‑offs include modest write amplification and potential version bloat, mitigated by background compaction and epoch‑based reclamation.
  • Real‑world databases like RocksDB, LMDB, and ZFS already leverage this pattern, proving its scalability from embedded devices to petabyte‑scale storage clusters.

Further Reading