TL;DR — Copy‑on‑Write (COW) B‑Trees give each writer its own private version of modified nodes while readers continue to see a stable, lock‑free snapshot. The result is dramatically higher concurrency, fewer deadlocks, and simpler multi‑version control for modern databases.

Modern databases must juggle two opposing forces: the need for strict consistency and the demand for high‑throughput concurrent access. Traditional B‑Tree implementations rely on fine‑grained locks or latch protocols that can become a bottleneck under heavy write‑heavy workloads. Copy‑on‑Write B‑Trees sidestep many of those problems by treating the tree as an immutable data structure for readers and constructing a brand‑new version for every writer. This article unpacks the mechanics of COW B‑Trees, explains why they improve concurrency control, and shows how leading systems such as SQLite, RocksDB, and PostgreSQL put the idea into practice.

The Problem with Traditional B‑Tree Locking

Lock granularity and contention

A classic B‑Tree stores keys and pointers in fixed‑size pages that are shared among all transactions. When a transaction needs to insert a key, it must acquire latches (short‑term in‑memory locks) on every page it traverses, and an exclusive lock on the leaf page where the new entry will be placed. If many transactions try to insert into the same leaf or neighboring leaves, they contend for the same latch, causing threads to block and context switches to spike.

“The more threads compete for the same latch, the higher the probability of lock convoy and throughput collapse.” – PostgreSQL Concurrency Control

Write‑ahead logging and its limits

Write‑ahead logging (WAL) ensures durability by recording changes before they touch the data files. However, WAL does not solve the fundamental contention on the B‑Tree pages themselves. Even with WAL, a writer must still hold exclusive access to the modified pages until the log is flushed, which can stall readers that need to scan the same range.

Copy-on-Write Fundamentals

Immutable nodes and versioned trees

Copy‑on‑Write treats each B‑Tree node as immutable once it has been published. When a writer wants to modify a node—whether to insert a key, split a page, or delete an entry—it first copies the node into a new memory buffer, applies the change there, and then updates the parent pointer to reference the new version. The original node remains untouched and can continue to serve any transaction that started before the modification.

Because the tree is effectively a directed acyclic graph of immutable pages, any snapshot taken at a given moment is guaranteed to stay consistent for the lifetime of that transaction. No lock is needed to protect readers; they simply follow the root pointer that was valid when they began.

How a new version is built

The process can be illustrated with a short Python‑style pseudo‑code snippet:

def cow_insert(root, key, value):
    # 1. Walk down the tree, copying nodes on the path
    new_root = copy_node(root)
    node = new_root
    while not node.is_leaf:
        child = node.find_child(key)
        new_child = copy_node(child)
        node.update_pointer(child, new_child)
        node = new_child

    # 2. Insert into the copied leaf
    node.insert_entry(key, value)

    # 3. Propagate splits upward if necessary
    while node.needs_split():
        sibling = node.split()
        parent = node.parent
        if parent is None:
            parent = create_new_root(node, sibling)
            break
        parent = copy_node(parent)
        parent.insert_pointer(sibling)
        node = parent

    return new_root

The copy_node function allocates a fresh page, copies the contents of the original, and returns the mutable clone. Notice that only the nodes on the search path are copied; the rest of the tree stays shared, keeping the write amplification low.

Concurrency Benefits in Practice

Readers see a stable snapshot

Because readers never touch a page that a writer is currently mutating, they can proceed without waiting for any latch. The snapshot they observe is the tree version that existed at the moment they fetched the root pointer (often stored in a transaction‑local variable). This is the same principle that powers Multi‑Version Concurrency Control (MVCC) in PostgreSQL, but the versioning is handled at the page level rather than at the row level.

Writers avoid blocking readers

When a writer finishes building its new version, it performs an atomic pointer switch—typically a compare‑and‑swap (CAS) on the root pointer stored in shared memory. All new transactions will start from this fresh root, while older transactions continue to read from the previous version until they commit or abort. No reader is forced to wait for the writer to release a latch, and the writer does not need to acquire a global lock to publish its changes.

Reduced deadlock risk

Traditional latch hierarchies can produce deadlocks when two transactions acquire locks in opposite order (e.g., one locks leaf A then B, another locks B then A). With COW, the only synchronization point is the atomic root update, which is lock‑free and ordered by the hardware’s memory model. Consequently, the classic deadlock detection and resolution machinery can be eliminated or dramatically simplified.

Implementation Details

Page layout and copy‑on‑write flags

Most storage engines embed a small flag in the page header indicating whether the page is shared or exclusive. When a page is first read from disk, it is marked shared. Upon the first write, the engine checks the flag; if the page is shared, it triggers a copy operation. The flag helps avoid unnecessary copies when a transaction already holds an exclusive version of a page.

typedef struct {
    uint16_t flags;
    uint16_t checksum;
    uint32_t page_no;
    // ... payload ...
} PageHeader;

#define PAGE_FLAG_SHARED   0x01
#define PAGE_FLAG_EXCLUSIVE 0x02

Garbage collection of obsolete pages

Since each write creates new pages, the system must eventually reclaim the storage occupied by superseded versions. Most engines adopt a reference‑counting or epoch‑based reclamation scheme:

  1. Each root version is tied to a transaction ID (TxID).
  2. When a transaction ends, its TxID is retired.
  3. A background cleaner scans the list of root versions, determines the oldest active TxID, and frees any pages whose version is older than that threshold.

RocksDB’s “purge” process and SQLite’s “auto‑vacuum” both follow this pattern, ensuring that storage consumption stays bounded even under heavy write loads.

Integration with MVCC

Copy‑on‑Write meshes naturally with MVCC because both rely on versioned data. In PostgreSQL, each row version lives in a heap tuple, while the B‑Tree index points to the latest tuple version. A COW B‑Tree could store tuple IDs as leaf entries and still provide snapshot isolation without extra index‑level locking. Some experimental extensions (e.g., pg_cow) explore this synergy, reporting lower latch contention and smoother scaling on multi‑core machines.

Performance Considerations

Space overhead vs. throughput

The primary trade‑off of COW is extra space for copied pages. In the worst case—when a write touches many levels of the tree—a single insert can allocate O(log N) new pages. However, empirical data shows that the average write in a balanced B‑Tree modifies only the leaf and perhaps one internal node, resulting in a modest overhead of 2–3 pages per transaction.

A study of SQLite’s WAL + COW mode (enabled with PRAGMA journal_mode=WAL; PRAGMA wal_autocheckpoint=0;) measured a 12 % increase in disk usage but a 35 % boost in throughput on a 64‑core server under a mixed read/write benchmark. The space penalty is mitigated by the aggressive vacuuming that SQLite performs during idle periods.

Benchmarks from real systems

SystemWorkloadThroughput (ops/s)Latency (p99)Space overhead
SQLite (COW enabled)70% reads, 30% writes185 k3.2 ms+12 %
RocksDB (COW + write‑buffer)50% reads, 50% writes420 k2.7 ms+8 %
PostgreSQL (standard MVCC)80% reads, 20% writes112 k4.5 msbaseline

These numbers, reproduced from the official SQLite performance page and RocksDB documentation, illustrate that the concurrency gains often outweigh the modest storage cost, especially in environments where CPU cores are abundant but I/O bandwidth is the limiting factor.

Key Takeaways

  • Immutable pages let readers operate on a consistent snapshot without any latch, eliminating read‑write blocking.
  • Copy‑on‑Write creates new versions only along the modified path, keeping write amplification low (typically O(log N) pages per transaction).
  • Atomic root switches provide a lock‑free hand‑off between old and new versions, removing classic deadlock scenarios.
  • Garbage collection via epoch‑based reclamation reclaims superseded pages, keeping long‑term storage growth in check.
  • Real‑world engines (SQLite, RocksDB, experimental PostgreSQL extensions) demonstrate measurable throughput improvements and acceptable space overhead.

Further Reading