TL;DR — Copy‑on‑write (CoW) B‑trees let you take point‑in‑time snapshots without blocking writes. By sharing unchanged nodes and only copying the path to modified data, you get O(log N) write overhead and O(1) snapshot cost, a pattern used by RocksDB, FoundationDB, and PostgreSQL’s MVCC.
Database snapshots are a cornerstone of modern data pipelines: they power point‑in‑time recovery, analytical off‑loading, and zero‑downtime schema migrations. Yet achieving low‑latency, low‑overhead snapshots on a write‑heavy workload is non‑trivial. This post unpacks the copy‑on‑write B‑tree—a data structure that makes snapshots cheap and safe—by walking through its core concepts, production‑grade architecture, and concrete patterns you can adopt today.
Background: Why Snapshots Matter in Production
- Disaster recovery – A snapshot taken before a risky deployment can be rolled back instantly.
- Analytics off‑loading – Long‑running analytical queries run against a stable view while the primary workload continues.
- Feature flag rollouts – Snapshots let you expose a new schema version to a subset of traffic without affecting the rest.
Traditional approaches (full database copy, logical dump, or lock‑step checkpointing) either stall writes or consume massive I/O. Copy‑on‑write B‑trees sidestep these problems by treating the index itself as the snapshot source.
Copy‑On‑Write B‑Tree Fundamentals
A B‑tree stores keys in sorted order across a hierarchy of pages (nodes). In a CoW variant:
- Immutable nodes – Once a page is written, it is never modified in place.
- Path copying – When inserting or deleting a key, only the nodes on the root‑to‑leaf path are duplicated; sibling sub‑trees are shared.
- Version pointers – Each snapshot holds a pointer to the root page that represents the logical view at snapshot time.
Example Walkthrough
Initial tree (root R0)
R0
/ \
L1 L2
Insert key K into leaf L1:
1. Allocate new leaf L1'
2. Copy contents of L1 into L1' and add K
3. Allocate new internal node R1 that points to L1' and unchanged L2
4. Snapshot S0 keeps pointer to R0; live view now uses R1
Only three pages were written, regardless of the total tree size. This constant‑time snapshot cost is the key performance win.
Architecture Overview
1. Storage Layer Integration
| Component | Responsibility | Example Implementations |
|---|---|---|
| Page Cache | Holds immutable pages in memory; reference‑counts them to know when they can be evicted. | RocksDB’s BlockCache, FoundationDB’s page cache |
| Write‑Ahead Log (WAL) | Persists new pages before they become visible; guarantees durability. | PostgreSQL WAL, Kafka log for change streams |
| Version Manager | Generates monotonically increasing snapshot IDs and maps them to root pointers. | SnapshotManager in RocksDB, VersionedTree in FoundationDB |
| Compaction Engine | Merges obsolete pages, rewrites them to reclaim space, and updates the version graph. | RocksDB’s CompactionJob, FoundationDB’s DataDistribution |
2. Data Flow Diagram
+-------------------+ +-------------------+ +-------------------+
| Client Writes | ----> | Write Path Copy | ----> | WAL Flush |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+
| New Root Pointer |
+-------------------+
|
v
+-------------------+
| Version Manager |
+-------------------+
|
v
+-------------------+
| Snapshot IDs |
+-------------------+
The diagram shows that every write creates a new root pointer; the version manager records it, and snapshots simply store that pointer.
3. Concurrency Model
- Read‑only transactions – Use the snapshot’s root pointer; they never block writers.
- Read‑write transactions – Acquire a write intent lock on the leaf pages they will modify; the lock is held only for the duration of the path copy, not for the whole transaction.
- Garbage collection – A background thread walks the version graph, decrements reference counts, and frees pages whose count reaches zero.
Patterns in Production
RocksDB (Log‑Structured Merge Tree with CoW B‑tree)
RocksDB builds a Log‑Structured Merge Tree (LSM) on top of a CoW B‑tree for its memtable and SSTable indexes. Snapshots are implemented by storing the current SequenceNumber and treating all entries ≤ that number as visible. The underlying B‑tree pages are immutable once flushed to disk, matching the CoW contract. See the official docs for details: RocksDB snapshots.
Pattern highlights
- Sequence‑based versioning – A single monotonically increasing number replaces explicit root pointers.
- Zero‑copy reads – Since SSTables are immutable, a read can map the file directly into memory.
- Compaction as GC – Periodic compaction rewrites overlapping SSTables, discarding pages that no longer belong to any live snapshot.
FoundationDB (Ordered Key‑Value Store)
FoundationDB stores data in a distributed B‑tree where each storage server maintains a local CoW B‑tree segment. Snapshots are exposed via read version numbers that map to root pointers on each server. The architecture is described in the FoundationDB Architecture whitepaper: FoundationDB documentation.
Pattern highlights
- Deterministic transaction ordering – Every transaction sees the same snapshot based on its read version.
- Fine‑grained version graph – Each server tracks a local version DAG, enabling per‑shard garbage collection.
- Atomic mutation operators –
add,bit_and, etc., are implemented as tiny CoW leaf updates, avoiding full page rewrites.
PostgreSQL MVCC (Multi‑Version Concurrency Control)
PostgreSQL’s heap files are not B‑trees, but its B‑tree indexes are built on a CoW principle: each index page is rewritten when a tuple becomes invisible. Snapshots correspond to transaction IDs (xmin). See the storage layout docs: PostgreSQL page layout.
Pattern highlights
- Transaction‑ID based visibility – Similar to sequence numbers, but tied to transaction lifecycle.
- HOT (Heap‑Only Tuple) optimization – Updates that stay on the same page avoid index changes, reducing CoW overhead.
- VACUUM as GC – Periodic vacuum reclaims dead tuples and index pages, analogous to compaction.
Performance Considerations
Write Amplification
- Path length – For a B‑tree of order m and N keys, depth ≈ logₘ N. Each write touches
depthpages. Choosing a higher fan‑out reduces depth but increases page size and cache pressure. - Split vs. Merge – Splits double the number of pages written for that operation; merges can be deferred to compaction to keep write latency low.
Read Amplification
- Cache locality – Since many snapshots share most pages, a well‑tuned LRU cache yields high hit rates. Reference counting helps the cache retain hot pages longer.
- Prefetching – When a query scans a range, prefetch the next leaf pages based on the immutable layout; no lock contention.
Space Overhead
- Snapshot churn – Frequent short‑lived snapshots increase the number of live root pointers, inflating reference counts. A common mitigation is to group snapshots (e.g., keep one per minute) and discard intermediate IDs.
- Compaction frequency – Aggressive compaction reduces space but consumes I/O; tune based on write throughput and latency SLAs.
Example Benchmark (RocksDB‑style)
| Workload | Avg Write Latency | Avg Read Latency | Snapshot Size Overhead |
|---|---|---|---|
| 10 M keys, 100 k ops/s, 5‑second snapshot interval | 1.2 ms | 0.8 ms | ~0.3 % of total DB size |
| 100 M keys, 1 M ops/s, 30‑second snapshot interval | 2.5 ms | 1.1 ms | ~1.2 % of total DB size |
These numbers illustrate that CoW B‑trees keep snapshot overhead near constant while scaling write throughput.
Implementation Tips (Code Samples)
Below is a minimal Python‑style illustration of a CoW B‑tree node and a write operation. In production you would replace the in‑memory structures with a page cache backed by a WAL and persistent storage.
from __future__ import annotations
from typing import List, Optional, Tuple
import uuid
# Simple immutable B‑tree node
class Node:
def __init__(self, keys: List[int], children: List["Node"], leaf: bool):
self.keys = tuple(keys) # immutable tuple
self.children = tuple(children) # immutable tuple
self.leaf = leaf
self.id = uuid.uuid4() # unique page identifier
def copy(self, keys: Optional[List[int]] = None,
children: Optional[List["Node"]] = None) -> "Node":
"""Create a new node with updated keys/children."""
return Node(
keys if keys is not None else list(self.keys),
children if children is not None else list(self.children),
self.leaf
)
# Version manager holds root pointers
class VersionManager:
def __init__(self, root: Node):
self.versions = {0: root}
self.current_version = 0
def snapshot(self) -> int:
self.current_version += 1
self.versions[self.current_version] = self.versions[self.current_version - 1]
return self.current_version
def set_root(self, new_root: Node):
self.versions[self.current_version] = new_root
# Insert routine that performs path copying
def insert(root: Node, key: int, order: int = 4) -> Node:
# Simplified split‑on‑full algorithm
if len(root.keys) == 2 * order - 1:
# Split root
left = Node(list(root.keys[:order-1]), list(root.children[:order]), root.leaf)
right = Node(list(root.keys[order:]), list(root.children[order+1:]), root.leaf)
new_root = Node([root.keys[order-1]], [left, right], leaf=False)
return insert_nonfull(new_root, key, order)
else:
return insert_nonfull(root, key, order)
def insert_nonfull(node: Node, key: int, order: int) -> Node:
if node.leaf:
# Insert key into leaf copy
new_keys = list(node.keys) + [key]
new_keys.sort()
return node.copy(keys=new_keys)
else:
# Find child to descend
idx = 0
while idx < len(node.keys) and key > node.keys[idx]:
idx += 1
child = node.children[idx]
# Recursively insert, getting a new child node
new_child = insert(child, key, order)
# Replace child in a new parent copy
new_children = list(node.children)
new_children[idx] = new_child
return node.copy(children=new_children)
# Demo
if __name__ == "__main__":
leaf = Node([], [], leaf=True)
vm = VersionManager(leaf)
root = vm.versions[0]
for k in [10, 20, 5, 15]:
root = insert(root, k)
vm.set_root(root)
snap_id = vm.snapshot()
print(f"Snapshot {snap_id} root id: {vm.versions[snap_id].id}")
Key takeaways from the sample
- Each
insertreturns a brand‑new root; the old root remains usable for any snapshot that references it. - The
VersionManagertracks root pointers without any mutable shared state. - In a real system you would write each new node to a WAL, assign a disk page number, and update reference counts.
Key Takeaways
- Copy‑on‑write B‑trees give O(1) snapshot creation by sharing unchanged nodes and only copying the write path.
- Production systems (RocksDB, FoundationDB, PostgreSQL) use variants of this pattern, often coupling it with sequence‑based versioning and background compaction.
- Write amplification stays bounded to the tree depth; tuning fan‑out and page size balances latency vs. cache pressure.
- Garbage collection is essential; reference‑counted pages and periodic compaction reclaim space without impacting live snapshots.
- Implementations should separate concerns: page cache, WAL, version manager, and compaction engine each have clear responsibilities and can be scaled independently.