TL;DR — Copy‑on‑Write (COW) B‑trees duplicate only the modified path of a tree, leaving the rest untouched. This tiny write‑amplification lets a database snap a consistent view in a single atomic step, without pausing queries or flushing the whole store.
Databases have long needed a way to capture a point‑in‑time view of their data without blocking ongoing reads and writes. Traditional approaches—full file copies, lock‑step checkpoints, or write‑ahead logs—either stall traffic or consume massive I/O. Copy‑on‑Write B‑trees solve this by marrying two classic ideas: the balanced search structure of B‑trees and the immutable‑by‑default semantics of COW. The result is an “instant snapshot”: a read‑only view that appears the moment the write begins, with virtually no extra cost.
What Is a B‑Tree?
A B‑tree is a self‑balancing multi‑way search tree designed for block‑oriented storage (disk or SSD). Its key properties:
- Node degree – each internal node holds between t and 2t‑1 keys (where t is the minimum degree).
- Height balance – all leaf nodes reside at the same depth, guaranteeing O(log N) lookup, insert, and delete.
- Cache‑friendly layout – a node’s size typically matches the storage page size (e.g., 4 KB), minimizing I/O per operation.
Because a B‑tree groups many keys per node, a single disk read can resolve large portions of a query. This makes B‑trees the default index structure in relational databases (PostgreSQL, MySQL), key‑value stores (RocksDB, LevelDB), and even file systems (Btrfs).
Copy‑on‑Write Fundamentals
Copy‑on‑Write (COW) is a memory‑management technique that treats data as immutable until a modification is required. The classic COW workflow:
read -> return existing page
write -> allocate new page
copy original content
apply delta
update pointer atomically
The crucial benefit is that readers never see a partially updated page. When a writer finishes, it swaps a single pointer (or version tag) to make the new page visible. This atomic pointer swap is the heart of instant snapshots: the old version remains reachable for any reader that started before the swap.
COW is the backbone of many modern systems:
- ZFS and Btrfs file systems use COW to implement filesystem‑level snapshots.
- LMDB (Lightning Memory‑Mapped Database) stores the entire database in a single memory‑mapped region, relying on COW for transaction isolation.
- Git stores objects as immutable blobs, creating new objects only when content changes.
Merging COW with B‑Trees
When you combine COW with a B‑tree, you get a persistent data structure: each mutation yields a new root node while preserving the previous version. The algorithm works like this:
def cow_insert(root, key, value):
# 1. Clone the path from root to leaf
new_root = clone_node(root)
node = new_root
stack = [] # keep track of cloned ancestors
# 2. Descend the tree, cloning on the way
while not node.is_leaf():
i = node.search_position(key)
child = node.children[i]
cloned_child = clone_node(child)
node.children[i] = cloned_child
stack.append(node)
node = cloned_child
# 3. Insert into the leaf (which is already a fresh copy)
node.insert_into_leaf(key, value)
# 4. Rebalance if needed, climbing back up the stack
while stack:
parent = stack.pop()
if node.needs_split():
left, right, median = node.split()
parent.insert_child(median, right)
node = left # continue fixing up the left side
else:
break
node = parent
return new_root
Key observations:
- Only the nodes on the search path are cloned; the rest of the tree stays shared between versions.
- The root pointer is the only global state that changes, making the snapshot operation a single atomic write.
- Because each node is page‑aligned, the clone operation is essentially a page copy—fast on modern storage devices.
When a transaction commits, the database writes the new root pointer to a metadata page (often called the “superblock”). All readers that started before this write continue to follow the old root, while new readers see the new root instantly. No lock is required beyond the atomic pointer store.
Instant Snapshots in Practice
PostgreSQL’s MVCC vs. COW B‑Tree Snapshots
PostgreSQL implements Multi‑Version Concurrency Control (MVCC) by storing tuple versions inside heap pages, not by cloning the index structure. While MVCC gives each transaction a consistent view, creating a true point‑in‑time snapshot still requires a heavyweight checkpoint or a base backup.
Contrast this with RocksDB (a fork of LevelDB) which uses a COW B‑tree called a Log‑Structured Merge‑Tree (LSM) for its memtable and a COW B‑tree for its BlockBasedTable format. When RocksDB compacts data, it writes new SST files and updates a manifest that points to the new root. The manifest update is the instant snapshot: any process that reads the manifest after the write sees the new version; processes that already opened the old manifest keep seeing the older state.
LMDB’s Zero‑Copy Snapshots
LMDB stores the entire B‑tree in a memory‑mapped file. Readers use the current root pointer stored in the meta page. When a writer commits, it writes a new root pointer to a fresh meta page and then atomically swaps the meta page number in the file header. Because the data pages themselves are never overwritten (they’re COW), readers continue to access the old pages without any copy. The snapshot appears instantly, and the writer incurs only the cost of copying the modified pages (typically a few kilobytes).
Example: Creating a Consistent Backup
# Step 1: Freeze the meta page (no need for a lock)
mdb_copy -V /var/lib/mydb data.backup
# Step 2: The copy now contains a point‑in‑time snapshot
The mdb_copy utility simply copies the file while the database may still be serving reads and writes. The COW guarantees that the on‑disk representation never changes under the copy operation.
Trade‑offs and Performance Considerations
| Aspect | Benefits of COW B‑Tree | Potential Downsides |
|---|---|---|
| Write Amplification | Only pages on the write path are duplicated; typical factor ≈ log N per update. | For bulk inserts, duplicated pages can accumulate, leading to higher storage usage if compaction is delayed. |
| Read Latency | Reads follow a stable tree; no lock contention. | If many versions coexist, the system may need to traverse deeper trees to find the correct version, slightly increasing lookup cost. |
| Space Overhead | Versions share unchanged pages, so overhead is proportional to changed data. | Long‑running transactions can prevent reclamation of older pages, causing “version bloat”. |
| Garbage Collection | Simple reference‑counting or epoch‑based reclamation can free pages once no transaction holds them. | Implementing safe reclamation across threads/processes adds complexity (e.g., hazard pointers). |
| Crash Recovery | The atomic meta‑page swap is crash‑safe; on restart, the latest valid meta page is used. | Requires a reliable write‑ahead log for the meta pages themselves, otherwise a power loss could corrupt the root pointer. |
Mitigating Version Bloat
Most production systems employ periodic checkpointing or compaction:
- Checkpoint – Write a new meta page that discards all but the latest root, then truncate older log segments.
- Compaction – Merge adjacent pages, rewrite them into fresh pages, and update pointers. This is similar to LSM tree compaction but works on B‑tree pages.
Both operations can be run in the background, preserving the instant‑snapshot property for client transactions.
Key Takeaways
- COW B‑trees duplicate only the modified path, making a snapshot as cheap as writing a single root pointer.
- Readers never block writers because they follow immutable pages, while writers work on fresh copies.
- Instant snapshots enable zero‑downtime backups, point‑in‑time recovery, and multi‑version reads without costly checkpoints.
- Performance trade‑offs include modest write amplification and potential version bloat, mitigated by background compaction and epoch‑based reclamation.
- Real‑world databases like RocksDB, LMDB, and ZFS already leverage this pattern, proving its scalability from embedded devices to petabyte‑scale storage clusters.