TL;DR — Copy‑on‑write (COW) B‑trees store each modification as a new version of the affected node, leaving older versions untouched. This immutable‑write pattern makes taking a consistent snapshot as cheap as recording a pointer, while reads on the snapshot remain fast because the tree structure stays balanced and cache‑friendly.

Database engines spend a lot of time juggling two competing goals: high‑throughput writes and consistent, low‑latency reads. Snapshots—point‑in‑time views of the data—are essential for backups, analytics, and multi‑version concurrency control (MVCC). Traditional snapshot mechanisms often involve heavy locking, copying large data files, or replaying logs, all of which add latency and storage overhead. Copy‑on‑write B‑trees solve this problem by turning the snapshot operation into a near‑zero‑cost pointer capture, while preserving the performance characteristics of a classic B‑tree for both reads and writes.

Below we dive into the theory, the data structures, and real‑world performance numbers that demonstrate why COW B‑trees have become the backbone of modern storage engines such as SQLite, RocksDB, and FoundationDB.

Understanding Copy-on-Write (COW)

What is COW?

Copy‑on‑write is a memory‑management technique first popularized by the Unix fork() system call and later adopted by file systems like ZFS and Btrfs. The core idea is simple:

  1. Shared read‑only data is referenced by multiple owners.
  2. When a writer needs to modify that data, it creates a private copy, leaving the original untouched.
  3. The writer then updates its reference to point at the new copy.

Because the original data never changes, any reader that held a reference before the write continues to see a consistent view. In a database context, each write creates a new version of the affected pages, and a snapshot is just a reference to the root page of the tree at a particular moment.

Historical context in file systems and databases

File systems such as ZFS leveraged COW to provide instantaneous snapshots and atomic writes without journaling. Early key‑value stores (e.g., LSM‑tree based) used append‑only logs, but they still required compaction to reclaim space. When B‑tree implementations started to adopt COW, they combined the balanced search properties of B‑trees with the versioning benefits of COW, yielding a structure that could serve both OLTP (online transaction processing) and OLAP (online analytical processing) workloads efficiently.

B‑Trees Basics

Structure and properties

A B‑tree is a self‑balancing multi‑way search tree defined by two parameters:

  • Order (or fan‑out) m – the maximum number of children a node can have.
  • Minimum occupancy – usually ⌈m/2⌉ children, guaranteeing logarithmic depth.

Each internal node stores up to m‑1 keys and m child pointers. Leaves store the actual records (or pointers to them) and are linked for range scans. The tree guarantees:

  • O(logₘ N) search, insert, and delete.
  • Sequential access via leaf links, which is crucial for range queries.
  • Disk‑friendly layout: nodes are sized to match page boundaries, reducing I/O.

Why B‑Trees are a workhorse for storage engines

  • Predictable I/O – each operation touches a bounded number of pages.
  • Cache efficiency – high fan‑out reduces depth, keeping more nodes in RAM.
  • Adaptability – B‑trees can be tuned for SSDs (larger pages) or in‑memory workloads (smaller pages).

Because of these traits, virtually every relational database still uses a B‑tree variant for its primary index.

Merging COW with B‑Trees

Versioned nodes and immutable writes

In a COW B‑tree, every node is immutable once written to disk. When an insert or delete touches a leaf, the engine:

  1. Allocates a new leaf page containing the updated records.
  2. Propagates the change upward, allocating new internal nodes for each level that points to the newly created child.
  3. Updates the root pointer to reference the new top‑level node.

The old nodes remain on disk unchanged, forming a historical version of the tree. A snapshot is simply the root pointer at the moment the transaction committed.

def cow_insert(root, key, value):
    """
    Insert (key, value) into a copy‑on‑write B‑tree.
    Returns a new root node reference.
    """
    # 1. Descend to leaf, collecting path
    path = []
    node = root
    while not node.is_leaf:
        path.append(node)
        node = node.child_for_key(key)

    # 2. Create a new leaf with the inserted entry
    new_leaf = node.clone()
    new_leaf.insert_into_leaf(key, value)

    # 3. Walk back up, cloning internal nodes that need the new child
    for parent in reversed(path):
        new_parent = parent.clone()
        new_parent.replace_child(node, new_leaf)
        new_leaf = new_parent
        node = parent

    # 4. New leaf becomes the new root
    return new_leaf

The clone() method copies the page content to a fresh buffer, assigns it a new page ID, and writes it to disk. No in‑place mutation occurs.

Write path vs read path

  • Write path – creates new pages, writes them sequentially, and finally updates the root pointer atomically (often via a WAL entry). The cost is proportional to the tree depth, typically 2–4 page writes for modern fan‑outs.
  • Read path – follows pointers from the current root (or a snapshot root) down to the leaf. Because the structure is identical to a classic B‑tree, read latency is unchanged.

Space amplification considerations

Each write leaves behind obsolete pages. To keep storage consumption reasonable, engines employ garbage collection (GC) or vacuum processes that:

  • Track the youngest snapshot still in use.
  • Reclaim pages whose versions are older than that snapshot.
  • Optionally merge adjacent pages during compaction.

The overhead is usually modest—often < 20% of live data—especially when workloads have a high turnover of short‑lived snapshots.

Snapshots in Modern Databases

Definition and use cases

A snapshot is a consistent view of the database at a particular logical timestamp. Common scenarios include:

  • Backup – copying the snapshot to off‑site storage without quiescing the database.
  • Analytics – running long‑running read‑only queries against a stable dataset while the live database continues to accept writes.
  • MVCC – providing each transaction with its own view, enabling lock‑free reads.

How COW B‑Trees enable cheap, consistent snapshots

Because nodes are immutable, a snapshot needs only to store the root page identifier of the tree at commit time. No data copying, no lock acquisition, and no log replay are required. When a reader opens a snapshot, the engine:

  1. Retrieves the root ID from the snapshot metadata.
  2. Traverses the tree exactly as it would for the live database, but using the versioned pages that existed at snapshot time.
  3. Releases the snapshot after the query finishes, allowing GC to eventually reclaim pages.

This approach is O(1) in time and space for snapshot creation, compared to O(N) for full copies.

Comparison with traditional MVCC and log‑based approaches

FeatureCOW B‑Tree SnapshotTraditional MVCC (row‑version)Log‑based Snapshot
Creation costConstant (pointer)Writes a row‑version record per updated rowRequires copying log or flushing dirty pages
Read overheadSame as live readsEach read must filter invisible versionsMay need to replay logs to reconstruct state
Write amplificationLimited to path lengthMay duplicate entire rowsLog writes already required; extra copying for snapshots
Space reclamationGC can drop whole pagesRow‑level tombstones lingerLog truncation needed after checkpoint

COW B‑trees combine the low‑cost snapshot creation of log‑based systems with the fast point‑lookup performance of row‑version MVCC, while keeping storage overhead manageable.

Performance Implications

Write amplification reduction

In a classic B‑tree, an update rewrites the leaf and possibly its ancestors, but the old pages are overwritten in place, which can cause write amplification on SSDs due to erase‑write cycles. COW B‑trees always write fresh pages, allowing the underlying storage to perform wear‑leveling naturally. The only extra writes are the new copies of the touched nodes, which is bounded by tree depth (usually ≤ 4).

Read‑only query speed on snapshots

Since the snapshot reuses the same balanced structure, range scans and point lookups execute with identical I/O patterns as on the live tree. Benchmarks on SQLite’s WAL mode (which uses a form of COW) show less than 2% latency increase for queries against a 30‑second-old snapshot compared to the current view.

Real‑world benchmarks

  • SQLite WAL – writes a new page for each modified leaf; snapshot is the WAL checkpoint. Tests on a 100 GB dataset show snapshot creation in < 1 ms, with read throughput at 95% of live performance (SQLite docs).
  • RocksDB – implements a COW B‑tree‑like structure called a memtable + SSTable hierarchy with immutable files. Snapshots are cheap pointers; read latency on snapshots is within 1–3 µs of live reads (RocksDB documentation).
  • FoundationDB – uses a versioned B‑tree where each transaction writes new versions of affected nodes. Snapshots (called “read versions”) are obtained in O(1) and provide consistent reads across the cluster (FoundationDB architecture).

These numbers illustrate that the theoretical benefits translate into measurable performance gains in production systems.

Implementation Challenges

Garbage collection of obsolete pages

Because every write leaves a trail of old pages, the engine must track which snapshots are still active. A common strategy:

  1. Maintain a snapshot list with reference counts.
  2. When a snapshot is released, decrement counts.
  3. Run a background reclaimer that scans the page table, freeing pages whose version is older than the oldest live snapshot.

The reclaimer must be careful not to interfere with ongoing reads; many systems use epoch‑based reclamation to ensure safety.

Balancing depth vs fan‑out

Higher fan‑out reduces tree depth, thus fewer page copies per write, but larger pages can hurt cache locality on CPUs with small L1/L2 caches. Tunable parameters (page size, max keys per node) need profiling against the target workload and hardware.

Concurrency control

Even though nodes are immutable, multiple writers may contend for the same path. Engines typically serialize writes at the root level using a single writer lock or employ optimistic concurrency with compare‑and‑swap on the root pointer. For high‑throughput workloads, batching multiple mutations into a single COW transaction can amortize the root‑update cost.

Key Takeaways

  • Copy‑on‑write turns snapshots into a cheap pointer operation, eliminating the need for heavy data copying or log replay.
  • B‑tree balance and fan‑out remain intact, so read performance on snapshots matches that of the live database.
  • Write amplification is bounded by tree depth, making COW B‑trees SSD‑friendly.
  • Garbage collection is essential; without it, space usage can grow unchecked.
  • Real‑world engines (SQLite, RocksDB, FoundationDB) demonstrate that COW B‑trees provide sub‑millisecond snapshot creation and near‑identical read latency on snapshots.

Further Reading