TL;DR — Copy‑on‑write (COW) B‑trees let a database take a point‑in‑time snapshot by duplicating only the modified path, not the whole structure. The result is an atomic, lock‑free view of the data that can be read safely while writers continue operating.
Databases must balance two competing goals: high‑throughput writes and reliable, consistent reads. Traditional snapshot techniques either pause writes, copy entire files, or rely on heavyweight locking—each approach hurts performance or adds complexity. Copy‑on‑write B‑trees solve this dilemma by turning the index itself into a versioned data structure. When a transaction modifies a leaf, the tree creates new versions of the affected nodes, leaving the old ones untouched. Those untouched nodes become the basis of an atomic snapshot that any reader can use without ever seeing a half‑written state.
The Problem with Traditional Snapshots
Full‑Copy Approaches
Most early relational engines created a snapshot by copying the entire datafile to a separate location. While conceptually simple, the I/O cost scales linearly with data size. In a 10 TB warehouse, a full copy can take hours, during which the system must either block writes or risk inconsistencies. Moreover, the copied data quickly becomes stale, forcing frequent recopy cycles.
Lock‑Based Consistency
Another common technique is to acquire a global read lock, preventing any write while the snapshot is taken. This guarantees a consistent view but throttles write throughput to zero for the lock’s duration. In high‑traffic OLTP systems, even a few milliseconds of lock can translate into thousands of lost transactions.
Log‑Based MVCC
Multi‑Version Concurrency Control (MVCC) stores previous row versions in a write‑ahead log. Readers pick the version that matches their transaction’s timestamp. While MVCC eliminates global locks, it still requires a separate cleanup process (vacuuming) to reclaim space, and the log can grow dramatically under heavy write loads.
All three methods suffer from either I/O overhead, write contention, or maintenance complexity—the very problems copy‑on‑write B‑trees aim to eliminate.
Copy‑on‑Write B‑Tree Fundamentals
A B‑tree is a balanced, multi‑way search tree where each node contains a sorted array of keys and child pointers. In a conventional B‑tree, updating a key mutates the node in place. Copy‑on‑write changes that rule: any mutation creates a new version of the node, leaving the original untouched.
Node Versioning
When a leaf node receives a new key/value pair, the engine allocates a fresh leaf page, copies the existing entries, inserts the new one, and then updates the parent pointer to reference the new leaf. The parent node itself may also need a new version if its child pointer changes. This process repeats up to the root, producing a new root pointer that represents the latest tree version.
def cow_insert(tree_root, key, value):
# Locate leaf (simplified)
leaf = find_leaf(tree_root, key)
# Copy leaf and insert
new_leaf = leaf.copy()
new_leaf.insert(key, value)
# Propagate new pointers upward
return propagate_up(tree_root, leaf, new_leaf)
Only the nodes along the search path are duplicated; the rest of the tree remains shared between versions. This path copying makes the operation O(log N) in both time and additional space, where N is the number of rows.
Path Copying
Consider a tree of height 4. Inserting a record modifies at most 4 nodes (leaf, its parent, grand‑parent, root). The untouched subtrees continue to be referenced by both the old and new root. Consequently, the old root still points to a fully functional, immutable snapshot of the database as it existed before the insertion.
The key insight is that snapshots are just older root pointers. By storing a reference to any historical root, the engine instantly obtains a consistent view of the entire dataset without any extra copying.
How Atomic Snapshots Are Built
Snapshot Creation Process
- Freeze the Root Pointer – When a transaction requests a snapshot, the engine records the current root page identifier.
- Publish the Snapshot – The identifier is stored in a snapshot table or handed to the client. No data is copied, and no locks are taken.
- Continue Normal Writes – Subsequent writes follow the COW path‑copying algorithm, producing new roots while the old root remains valid for readers.
Because the old root never changes, any reader that starts from it sees a fully consistent state that cannot be altered by concurrent writers. This property satisfies the definition of an atomic snapshot: a view that reflects the database at a single logical instant.
Consistency Guarantees
Copy‑on‑write B‑trees provide serializable isolation for snapshot readers without requiring lock escalation. The snapshot is read‑only by design; attempts to modify it are either rejected or automatically redirected to a new version. As described in the PostgreSQL documentation on heap tables and visibility maps (see the PostgreSQL MVCC docs), a similar principle underlies transaction snapshots, but COW B‑trees push the versioning down to the index layer, making the guarantee cheaper and more deterministic.
Performance Implications
Write Amplification
While COW reduces read‑write interference, it introduces write amplification: each logical update may cause up to height physical page writes. However, modern storage devices (NVMe SSDs) handle small random writes efficiently, and the extra writes are bounded by log₂(N) rather than the linear cost of full copies.
Read‑Optimized Access
Snapshot readers traverse the old root, following a path that contains only shared nodes. Because those nodes are never mutated, they can be cached indefinitely, leading to high cache hit rates. In RocksDB benchmarks, read latency for point‑in‑time snapshots is comparable to reading the latest data, with a marginal overhead of 2–3 % (RocksDB docs).
Space Reclamation
Old versions linger until no active snapshot references them. The engine tracks reference counts per root and per page. When the count drops to zero, a background garbage collector frees the pages. This approach mirrors the vacuum process in PostgreSQL but operates at the page level, making reclamation incremental and less disruptive.
Real‑World Implementations
PostgreSQL’s B‑Tree Indexes
PostgreSQL uses a variant of COW for its B‑tree indexes. When a page split occurs, the new page is written and the parent is updated, leaving the original page untouched for any snapshot that still references it. The mechanism is detailed in the PostgreSQL source code comments.
RocksDB and LSM‑Tree Hybrids
RocksDB, built on a Log‑Structured Merge (LSM) tree, also incorporates COW at the block level for its blob storage and column families. Snapshots are implemented by retaining a pointer to the immutable memtable and the set of SST files that constitute the view (RocksDB snapshots).
ZFS and Filesystem Snapshots
Although not a database, ZFS demonstrates the power of COW B‑trees for whole‑filesystem snapshots. Each ZFS dataset is a B‑tree of blocks; taking a snapshot simply records the current root block pointer. The same principle applies to database indexes that adopt the COW model.
Key Takeaways
- Path copying duplicates only the nodes on the write path, keeping the rest of the tree shared between versions.
- An atomic snapshot is just an older root pointer; no data is copied and no locks are needed.
- Write amplification is bounded by the tree height, while read performance benefits from immutable pages that stay hot in cache.
- Real‑world systems like PostgreSQL, RocksDB, and ZFS already rely on COW B‑trees to provide fast, crash‑consistent snapshots.
- Proper reference‑count tracking and background garbage collection reclaim space without disrupting active snapshots.