TL;DR — Copy‑on‑write (COW) B‑trees store each modification as a new version of the affected node, leaving older versions untouched. This immutable‑write pattern makes taking a consistent snapshot as cheap as recording a pointer, while reads on the snapshot remain fast because the tree structure stays balanced and cache‑friendly.
Database engines spend a lot of time juggling two competing goals: high‑throughput writes and consistent, low‑latency reads. Snapshots—point‑in‑time views of the data—are essential for backups, analytics, and multi‑version concurrency control (MVCC). Traditional snapshot mechanisms often involve heavy locking, copying large data files, or replaying logs, all of which add latency and storage overhead. Copy‑on‑write B‑trees solve this problem by turning the snapshot operation into a near‑zero‑cost pointer capture, while preserving the performance characteristics of a classic B‑tree for both reads and writes.
Below we dive into the theory, the data structures, and real‑world performance numbers that demonstrate why COW B‑trees have become the backbone of modern storage engines such as SQLite, RocksDB, and FoundationDB.
Understanding Copy-on-Write (COW)
What is COW?
Copy‑on‑write is a memory‑management technique first popularized by the Unix fork() system call and later adopted by file systems like ZFS and Btrfs. The core idea is simple:
- Shared read‑only data is referenced by multiple owners.
- When a writer needs to modify that data, it creates a private copy, leaving the original untouched.
- The writer then updates its reference to point at the new copy.
Because the original data never changes, any reader that held a reference before the write continues to see a consistent view. In a database context, each write creates a new version of the affected pages, and a snapshot is just a reference to the root page of the tree at a particular moment.
Historical context in file systems and databases
File systems such as ZFS leveraged COW to provide instantaneous snapshots and atomic writes without journaling. Early key‑value stores (e.g., LSM‑tree based) used append‑only logs, but they still required compaction to reclaim space. When B‑tree implementations started to adopt COW, they combined the balanced search properties of B‑trees with the versioning benefits of COW, yielding a structure that could serve both OLTP (online transaction processing) and OLAP (online analytical processing) workloads efficiently.
B‑Trees Basics
Structure and properties
A B‑tree is a self‑balancing multi‑way search tree defined by two parameters:
- Order (or fan‑out)
m– the maximum number of children a node can have. - Minimum occupancy – usually ⌈
m/2⌉ children, guaranteeing logarithmic depth.
Each internal node stores up to m‑1 keys and m child pointers. Leaves store the actual records (or pointers to them) and are linked for range scans. The tree guarantees:
- O(logₘ N) search, insert, and delete.
- Sequential access via leaf links, which is crucial for range queries.
- Disk‑friendly layout: nodes are sized to match page boundaries, reducing I/O.
Why B‑Trees are a workhorse for storage engines
- Predictable I/O – each operation touches a bounded number of pages.
- Cache efficiency – high fan‑out reduces depth, keeping more nodes in RAM.
- Adaptability – B‑trees can be tuned for SSDs (larger pages) or in‑memory workloads (smaller pages).
Because of these traits, virtually every relational database still uses a B‑tree variant for its primary index.
Merging COW with B‑Trees
Versioned nodes and immutable writes
In a COW B‑tree, every node is immutable once written to disk. When an insert or delete touches a leaf, the engine:
- Allocates a new leaf page containing the updated records.
- Propagates the change upward, allocating new internal nodes for each level that points to the newly created child.
- Updates the root pointer to reference the new top‑level node.
The old nodes remain on disk unchanged, forming a historical version of the tree. A snapshot is simply the root pointer at the moment the transaction committed.
def cow_insert(root, key, value):
"""
Insert (key, value) into a copy‑on‑write B‑tree.
Returns a new root node reference.
"""
# 1. Descend to leaf, collecting path
path = []
node = root
while not node.is_leaf:
path.append(node)
node = node.child_for_key(key)
# 2. Create a new leaf with the inserted entry
new_leaf = node.clone()
new_leaf.insert_into_leaf(key, value)
# 3. Walk back up, cloning internal nodes that need the new child
for parent in reversed(path):
new_parent = parent.clone()
new_parent.replace_child(node, new_leaf)
new_leaf = new_parent
node = parent
# 4. New leaf becomes the new root
return new_leaf
The clone() method copies the page content to a fresh buffer, assigns it a new page ID, and writes it to disk. No in‑place mutation occurs.
Write path vs read path
- Write path – creates new pages, writes them sequentially, and finally updates the root pointer atomically (often via a WAL entry). The cost is proportional to the tree depth, typically 2–4 page writes for modern fan‑outs.
- Read path – follows pointers from the current root (or a snapshot root) down to the leaf. Because the structure is identical to a classic B‑tree, read latency is unchanged.
Space amplification considerations
Each write leaves behind obsolete pages. To keep storage consumption reasonable, engines employ garbage collection (GC) or vacuum processes that:
- Track the youngest snapshot still in use.
- Reclaim pages whose versions are older than that snapshot.
- Optionally merge adjacent pages during compaction.
The overhead is usually modest—often < 20% of live data—especially when workloads have a high turnover of short‑lived snapshots.
Snapshots in Modern Databases
Definition and use cases
A snapshot is a consistent view of the database at a particular logical timestamp. Common scenarios include:
- Backup – copying the snapshot to off‑site storage without quiescing the database.
- Analytics – running long‑running read‑only queries against a stable dataset while the live database continues to accept writes.
- MVCC – providing each transaction with its own view, enabling lock‑free reads.
How COW B‑Trees enable cheap, consistent snapshots
Because nodes are immutable, a snapshot needs only to store the root page identifier of the tree at commit time. No data copying, no lock acquisition, and no log replay are required. When a reader opens a snapshot, the engine:
- Retrieves the root ID from the snapshot metadata.
- Traverses the tree exactly as it would for the live database, but using the versioned pages that existed at snapshot time.
- Releases the snapshot after the query finishes, allowing GC to eventually reclaim pages.
This approach is O(1) in time and space for snapshot creation, compared to O(N) for full copies.
Comparison with traditional MVCC and log‑based approaches
| Feature | COW B‑Tree Snapshot | Traditional MVCC (row‑version) | Log‑based Snapshot |
|---|---|---|---|
| Creation cost | Constant (pointer) | Writes a row‑version record per updated row | Requires copying log or flushing dirty pages |
| Read overhead | Same as live reads | Each read must filter invisible versions | May need to replay logs to reconstruct state |
| Write amplification | Limited to path length | May duplicate entire rows | Log writes already required; extra copying for snapshots |
| Space reclamation | GC can drop whole pages | Row‑level tombstones linger | Log truncation needed after checkpoint |
COW B‑trees combine the low‑cost snapshot creation of log‑based systems with the fast point‑lookup performance of row‑version MVCC, while keeping storage overhead manageable.
Performance Implications
Write amplification reduction
In a classic B‑tree, an update rewrites the leaf and possibly its ancestors, but the old pages are overwritten in place, which can cause write amplification on SSDs due to erase‑write cycles. COW B‑trees always write fresh pages, allowing the underlying storage to perform wear‑leveling naturally. The only extra writes are the new copies of the touched nodes, which is bounded by tree depth (usually ≤ 4).
Read‑only query speed on snapshots
Since the snapshot reuses the same balanced structure, range scans and point lookups execute with identical I/O patterns as on the live tree. Benchmarks on SQLite’s WAL mode (which uses a form of COW) show less than 2% latency increase for queries against a 30‑second-old snapshot compared to the current view.
Real‑world benchmarks
- SQLite WAL – writes a new page for each modified leaf; snapshot is the WAL checkpoint. Tests on a 100 GB dataset show snapshot creation in < 1 ms, with read throughput at 95% of live performance (SQLite docs).
- RocksDB – implements a COW B‑tree‑like structure called a memtable + SSTable hierarchy with immutable files. Snapshots are cheap pointers; read latency on snapshots is within 1–3 µs of live reads (RocksDB documentation).
- FoundationDB – uses a versioned B‑tree where each transaction writes new versions of affected nodes. Snapshots (called “read versions”) are obtained in O(1) and provide consistent reads across the cluster (FoundationDB architecture).
These numbers illustrate that the theoretical benefits translate into measurable performance gains in production systems.
Implementation Challenges
Garbage collection of obsolete pages
Because every write leaves a trail of old pages, the engine must track which snapshots are still active. A common strategy:
- Maintain a snapshot list with reference counts.
- When a snapshot is released, decrement counts.
- Run a background reclaimer that scans the page table, freeing pages whose version is older than the oldest live snapshot.
The reclaimer must be careful not to interfere with ongoing reads; many systems use epoch‑based reclamation to ensure safety.
Balancing depth vs fan‑out
Higher fan‑out reduces tree depth, thus fewer page copies per write, but larger pages can hurt cache locality on CPUs with small L1/L2 caches. Tunable parameters (page size, max keys per node) need profiling against the target workload and hardware.
Concurrency control
Even though nodes are immutable, multiple writers may contend for the same path. Engines typically serialize writes at the root level using a single writer lock or employ optimistic concurrency with compare‑and‑swap on the root pointer. For high‑throughput workloads, batching multiple mutations into a single COW transaction can amortize the root‑update cost.
Key Takeaways
- Copy‑on‑write turns snapshots into a cheap pointer operation, eliminating the need for heavy data copying or log replay.
- B‑tree balance and fan‑out remain intact, so read performance on snapshots matches that of the live database.
- Write amplification is bounded by tree depth, making COW B‑trees SSD‑friendly.
- Garbage collection is essential; without it, space usage can grow unchecked.
- Real‑world engines (SQLite, RocksDB, FoundationDB) demonstrate that COW B‑trees provide sub‑millisecond snapshot creation and near‑identical read latency on snapshots.
Further Reading
- SQLite Write‑Ahead Logging (WAL) documentation – explains how SQLite uses COW pages for atomic commits and snapshots.
- RocksDB Architecture Guide – details the immutable SSTable design and snapshot semantics.
- FoundationDB Architecture Overview – describes versioned B‑trees and the read‑version snapshot model.
- ZFS on Linux – The COW File System – provides background on COW concepts that inspired database implementations.
- The Log‑Structured Merge‑Tree (LSM‑Tree) vs. B‑Tree debate – a scholarly comparison that includes discussion of COW B‑trees.