Why Copy-on-Write B-Trees Enable Atomic Database Snapshots

TL;DR — Copy‑on‑write (COW) B‑trees let a database take a point‑in‑time snapshot by duplicating only the modified path, not the whole structure. The result is an atomic, lock‑free view of the data that can be read safely while writers continue operating.

Databases must balance two competing goals: high‑throughput writes and reliable, consistent reads. Traditional snapshot techniques either pause writes, copy entire files, or rely on heavyweight locking—each approach hurts performance or adds complexity. Copy‑on‑write B‑trees solve this dilemma by turning the index itself into a versioned data structure. When a transaction modifies a leaf, the tree creates new versions of the affected nodes, leaving the old ones untouched. Those untouched nodes become the basis of an atomic snapshot that any reader can use without ever seeing a half‑written state.

The Problem with Traditional Snapshots

Full‑Copy Approaches

Most early relational engines created a snapshot by copying the entire datafile to a separate location. While conceptually simple, the I/O cost scales linearly with data size. In a 10 TB warehouse, a full copy can take hours, during which the system must either block writes or risk inconsistencies. Moreover, the copied data quickly becomes stale, forcing frequent recopy cycles.

Lock‑Based Consistency

Another common technique is to acquire a global read lock, preventing any write while the snapshot is taken. This guarantees a consistent view but throttles write throughput to zero for the lock’s duration. In high‑traffic OLTP systems, even a few milliseconds of lock can translate into thousands of lost transactions.

Log‑Based MVCC

Multi‑Version Concurrency Control (MVCC) stores previous row versions in a write‑ahead log. Readers pick the version that matches their transaction’s timestamp. While MVCC eliminates global locks, it still requires a separate cleanup process (vacuuming) to reclaim space, and the log can grow dramatically under heavy write loads.

All three methods suffer from either I/O overhead, write contention, or maintenance complexity—the very problems copy‑on‑write B‑trees aim to eliminate.

Copy‑on‑Write B‑Tree Fundamentals

A B‑tree is a balanced, multi‑way search tree where each node contains a sorted array of keys and child pointers. In a conventional B‑tree, updating a key mutates the node in place. Copy‑on‑write changes that rule: any mutation creates a new version of the node, leaving the original untouched.

Node Versioning

When a leaf node receives a new key/value pair, the engine allocates a fresh leaf page, copies the existing entries, inserts the new one, and then updates the parent pointer to reference the new leaf. The parent node itself may also need a new version if its child pointer changes. This process repeats up to the root, producing a new root pointer that represents the latest tree version.

def cow_insert(tree_root, key, value):
    # Locate leaf (simplified)
    leaf = find_leaf(tree_root, key)
    # Copy leaf and insert
    new_leaf = leaf.copy()
    new_leaf.insert(key, value)
    # Propagate new pointers upward
    return propagate_up(tree_root, leaf, new_leaf)

Only the nodes along the search path are duplicated; the rest of the tree remains shared between versions. This path copying makes the operation O(log N) in both time and additional space, where N is the number of rows.

Path Copying

Consider a tree of height 4. Inserting a record modifies at most 4 nodes (leaf, its parent, grand‑parent, root). The untouched subtrees continue to be referenced by both the old and new root. Consequently, the old root still points to a fully functional, immutable snapshot of the database as it existed before the insertion.

The key insight is that snapshots are just older root pointers. By storing a reference to any historical root, the engine instantly obtains a consistent view of the entire dataset without any extra copying.

How Atomic Snapshots Are Built

Snapshot Creation Process

Freeze the Root Pointer – When a transaction requests a snapshot, the engine records the current root page identifier.
Publish the Snapshot – The identifier is stored in a snapshot table or handed to the client. No data is copied, and no locks are taken.
Continue Normal Writes – Subsequent writes follow the COW path‑copying algorithm, producing new roots while the old root remains valid for readers.

Because the old root never changes, any reader that starts from it sees a fully consistent state that cannot be altered by concurrent writers. This property satisfies the definition of an atomic snapshot: a view that reflects the database at a single logical instant.

Consistency Guarantees

Copy‑on‑write B‑trees provide serializable isolation for snapshot readers without requiring lock escalation. The snapshot is read‑only by design; attempts to modify it are either rejected or automatically redirected to a new version. As described in the PostgreSQL documentation on heap tables and visibility maps (see the PostgreSQL MVCC docs), a similar principle underlies transaction snapshots, but COW B‑trees push the versioning down to the index layer, making the guarantee cheaper and more deterministic.

Performance Implications

Write Amplification

While COW reduces read‑write interference, it introduces write amplification: each logical update may cause up to height physical page writes. However, modern storage devices (NVMe SSDs) handle small random writes efficiently, and the extra writes are bounded by log₂(N) rather than the linear cost of full copies.

Read‑Optimized Access

Snapshot readers traverse the old root, following a path that contains only shared nodes. Because those nodes are never mutated, they can be cached indefinitely, leading to high cache hit rates. In RocksDB benchmarks, read latency for point‑in‑time snapshots is comparable to reading the latest data, with a marginal overhead of 2–3 % (RocksDB docs).

Space Reclamation

Old versions linger until no active snapshot references them. The engine tracks reference counts per root and per page. When the count drops to zero, a background garbage collector frees the pages. This approach mirrors the vacuum process in PostgreSQL but operates at the page level, making reclamation incremental and less disruptive.

Real‑World Implementations

PostgreSQL’s B‑Tree Indexes

PostgreSQL uses a variant of COW for its B‑tree indexes. When a page split occurs, the new page is written and the parent is updated, leaving the original page untouched for any snapshot that still references it. The mechanism is detailed in the PostgreSQL source code comments.

RocksDB and LSM‑Tree Hybrids

RocksDB, built on a Log‑Structured Merge (LSM) tree, also incorporates COW at the block level for its blob storage and column families. Snapshots are implemented by retaining a pointer to the immutable memtable and the set of SST files that constitute the view (RocksDB snapshots).

ZFS and Filesystem Snapshots

Although not a database, ZFS demonstrates the power of COW B‑trees for whole‑filesystem snapshots. Each ZFS dataset is a B‑tree of blocks; taking a snapshot simply records the current root block pointer. The same principle applies to database indexes that adopt the COW model.

Key Takeaways

Path copying duplicates only the nodes on the write path, keeping the rest of the tree shared between versions.
An atomic snapshot is just an older root pointer; no data is copied and no locks are needed.
Write amplification is bounded by the tree height, while read performance benefits from immutable pages that stay hot in cache.
Real‑world systems like PostgreSQL, RocksDB, and ZFS already rely on COW B‑trees to provide fast, crash‑consistent snapshots.
Proper reference‑count tracking and background garbage collection reclaim space without disrupting active snapshots.

The Problem with Traditional Snapshots#

Full‑Copy Approaches#

Lock‑Based Consistency#

Log‑Based MVCC#

Copy‑on‑Write B‑Tree Fundamentals#

Node Versioning#

Path Copying#

How Atomic Snapshots Are Built#

Snapshot Creation Process#

Consistency Guarantees#

Performance Implications#

Write Amplification#

Read‑Optimized Access#

Space Reclamation#

Real‑World Implementations#

PostgreSQL’s B‑Tree Indexes#

RocksDB and LSM‑Tree Hybrids#

ZFS and Filesystem Snapshots#

Key Takeaways#

Further Reading#