TL;DR — Copy‑on‑write (COW) allows a database to keep a single in‑memory representation of data that is read by many transactions. When a transaction modifies a page, only that page is duplicated, so the overall memory footprint stays low while snapshots and concurrency stay fast.
Databases have to juggle three competing goals: low memory consumption, high read‑write concurrency, and the ability to take consistent snapshots for backups or analytical queries. Traditional approaches either copy whole data structures eagerly—wasting RAM—or lock large portions of memory, throttling throughput. Copy‑on‑write (COW) offers a middle ground: share pages read‑only across transactions and materialize a private copy only when a write is needed. This article explains the mechanics of COW, walks through real‑world implementations, and shows how you can measure and tune its impact on memory usage.
Understanding Copy‑On‑Write
The basic idea
At its core, COW is a lazy duplication strategy. Imagine a database page (a fixed‑size block, typically 8 KB) that holds rows or index entries. When a transaction reads the page, it receives a reference to the shared version. The page is marked read‑only in the process’s page table, so the operating system can safely let many threads map the same physical memory.
When a transaction tries to modify the page, the CPU triggers a page‑fault because the page is read‑only. The database’s memory manager intercepts the fault, allocates a fresh page, copies the original contents, and then patches the faulting transaction’s page table entry to point at the new copy. From this point forward, the transaction works on its private copy while all other readers continue to see the untouched original.
This pattern is illustrated in the following pseudo‑code:
def read_page(tx, page_id):
# Return a reference to the shared, read‑only page
return shared_pages[page_id]
def write_page(tx, page_id, modifications):
# Trigger COW if needed
if page_is_shared(page_id):
new_page = allocate_page()
copy_page(shared_pages[page_id], new_page)
tx.private_pages[page_id] = new_page
mark_page_exclusive(tx, page_id)
else:
new_page = tx.private_pages[page_id]
apply_modifications(new_page, modifications)
The key point is that only the pages that actually change are copied, while the rest stay shared. In a workload where writes touch a small fraction of the dataset, memory savings can be dramatic.
Historical context
COW is not new; it dates back to early operating systems such as Unix’s fork() implementation, where a child process initially shares its parent’s address space. In databases, the concept became mainstream with Multi‑Version Concurrency Control (MVCC) in systems like PostgreSQL and Oracle. MVCC stores a snapshot of each row version, and the underlying storage engine uses COW to avoid rewriting unchanged pages. Modern key‑value stores (RocksDB, LMDB) and in‑memory databases (Redis modules, VoltDB) also rely on COW for durability and isolation.
How COW Reduces Memory Footprint
Shared pages vs. eager copies
Consider a table with 10 million rows, each row occupying 100 bytes. If the database stored a full copy for every transaction, the memory usage would explode proportionally to the number of concurrent writers. With COW, the baseline memory is roughly the size of the dataset plus a small overhead for page tables and metadata.
| Scenario | Approx. Memory (GB) |
|---|---|
| Eager copy per writer (5 writers) | 5 × 1 GB ≈ 5 GB |
| COW with 5 writers, 2 % pages dirty | 1 GB + 0.02 × 1 GB ≈ 1.02 GB |
| COW with 20 writers, 1 % pages dirty | 1 GB + 0.01 × 1 GB ≈ 1.01 GB |
The table shows that even with many writers, the additional memory stays bounded by the fraction of dirty pages, not by the number of writers.
Page‑level granularity
Most modern databases allocate memory in fixed‑size pages (8 KB or 16 KB). COW operates at this granularity, which balances two competing concerns:
- Fine‑grained sharing – Small pages mean that a write affecting a single row only copies a tiny block.
- Management overhead – Too small a page size would increase page‑table entries and fragmentation.
Empirical studies (e.g., the RocksDB COW benchmark) find that 8 KB pages give the best trade‑off for OLTP workloads, achieving up to 70 % memory reduction compared with eager copying.
Snapshot isolation without extra buffers
When a database needs a consistent snapshot—for backup, analytics, or replication—it can simply keep a reference to the current shared pages. New writers will COW those pages, leaving the snapshot untouched. No extra buffer pool or copy‑on‑snapshot phase is required.
For example, PostgreSQL’s HOT (Heap‑Only Tuple) optimization stores updated rows in the same page when possible, but when a page must be split, it relies on COW to preserve the original page for active snapshots. The documentation explains this mechanism in detail: PostgreSQL MVCC Internals.
Implementations in Popular Databases
PostgreSQL
PostgreSQL’s storage engine uses a combination of MVCC and COW. Each row version is stored as a tuple with transaction IDs. The visibility map tracks which pages contain only visible tuples, allowing the vacuum process to recycle pages without copying. When a transaction updates a row, PostgreSQL writes a new tuple to a new page if the current page is already full or if the update would break HOT constraints. The original page remains read‑only for any transaction that started before the update, effectively a COW operation.
Key source: PostgreSQL Documentation – Concurrency Control.
SQLite (Write‑Ahead Logging)
SQLite’s WAL mode separates the database file (read‑only) from the wal file (append‑only). Readers access the stable database file, while writers append changes to the WAL. When a checkpoint occurs, SQLite copies only the modified pages from the WAL back into the database file, a form of coarse‑grained COW that avoids rewriting untouched pages.
Reference: SQLite WAL Mode.
RocksDB
RocksDB, a high‑performance key‑value store derived from LevelDB, implements COW at the memtable and SST file levels. Writes first land in a mutable memtable; when a memtable is flushed, it becomes immutable and is shared across snapshots. New writes create a fresh memtable, leaving the immutable one untouched for readers. This strategy is described as copy‑on‑write in the RocksDB documentation.
Reference: RocksDB Architecture.
LMDB
LMDB (Lightning Memory‑Mapped Database) maps the entire database file into the process address space. It uses a single‑writer, multi‑reader model where the writer performs COW on the B‑tree nodes it modifies. Readers continue to see the old version of the tree until they finish their transaction, after which the old pages are reclaimed.
Reference: LMDB Documentation.
Trade‑offs and Best Practices
When COW may not help
- Write‑heavy workloads with high page churn – If most pages are dirty in each transaction, the benefit shrinks because almost every page gets duplicated.
- Very large pages – Copying an 8 KB page is cheap, but copying a 1 MB page can dominate latency and memory.
- Limited address space – On 32‑bit systems, the number of simultaneously mapped pages may hit the virtual‑address ceiling, forcing the engine to evict pages more aggressively.
Configuring page size and thresholds
Most engines expose settings to tune COW behavior:
| Engine | Setting | Typical Values | Effect |
|---|---|---|---|
| PostgreSQL | wal_block_size | 8 KB (default) | Determines WAL page granularity; smaller values give finer COW granularity. |
| RocksDB | write_buffer_size | 64 MB – 256 MB | Larger buffers delay flushing, reducing the number of immutable memtables that must be COW‑shared. |
| SQLite | wal_autocheckpoint | 1000 pages | Controls how often dirty pages are checkpointed back into the main file. |
| LMDB | MDB_WRITEMAP flag | true/false | Enables write‑mapped COW; disabling can reduce memory pressure at the cost of slower writes. |
A common rule of thumb is to start with the engine’s defaults, then monitor dirty‑page ratio (dirty pages / total pages). If the ratio stays below 5 %, you are likely in the sweet spot for COW.
Monitoring and debugging
- Page‑fault counters – On Linux,
perf stat -e page-faultsreveals how many COW events occur. - Memory usage metrics – Tools like
htop,pmap, or engine‑specific stats (pg_stat_activity,rocksdb.stats) show the resident set size (RSS) versus virtual memory size (VMS). - Snapshot latency – Measure the time to start a consistent snapshot; with COW it should be near‑constant regardless of dataset size.
Example: Measuring COW impact in PostgreSQL
# Enable pg_stat_statements for detailed stats
psql -c "CREATE EXTENSION pg_stat_statements;"
# Capture baseline memory usage
ps aux | grep postgres | awk '{print $6}' | paste -sd+ - | bc # RSS in KB
# Run a mixed read/write benchmark (e.g., pgbench)
pgbench -c 10 -j 2 -T 60 mydb
# After benchmark, check dirty page ratio
psql -c "SELECT sum(blk_read) / sum(blk_written) FROM pg_stat_database;"
The ratio of blk_written to blk_read gives an indirect view of how many pages were duplicated. A low ratio indicates that most reads hit shared pages, confirming COW effectiveness.
Measuring COW Benefits
Quantitative approach
- Baseline – Record RSS and VMS before enabling COW (or before any workload).
- Load – Run a representative mix of reads and writes for a fixed duration (e.g., 30 minutes).
- Collect – Capture RSS/VMS again, along with page‑fault counts.
- Calculate –
Memory saved = (Baseline RSS – Post‑load RSS).
COW efficiency = (Number of page faults) / (Number of write operations).
Interpreting results
| COW Efficiency | Interpretation |
|---|---|
| > 0.8 | Nearly every write caused a copy; workload is write‑heavy, COW offers little memory gain. |
| 0.2 – 0.5 | Moderate write activity; COW reduces memory by ~30 %‑50 %. |
| < 0.2 | Writes are sparse; COW achieves > 70 % memory savings. |
If your efficiency is high, consider batching writes or increasing page size to reduce the number of faults, or evaluate whether a different isolation level (e.g., snapshot isolation without COW) might be more appropriate.
Key Takeaways
- Copy‑on‑write shares read‑only pages across transactions, duplicating only the pages that are actually modified.
- Memory overhead scales with the dirty‑page ratio, not with the number of concurrent writers, making COW ideal for read‑heavy workloads.
- Snapshots become cheap: a snapshot is just a reference to the current shared pages; new writes COW those pages automatically.
- Real‑world engines (PostgreSQL, SQLite, RocksDB, LMDB) already embed COW, but tuning page size, buffer thresholds, and checkpoint intervals can amplify benefits.
- Monitoring page‑fault rates and dirty‑page ratios lets you quantify COW’s impact and decide when to adjust configuration.
Further Reading
- PostgreSQL MVCC Internals – In‑depth explanation of PostgreSQL’s versioning and COW mechanisms.
- SQLite Write‑Ahead Logging (WAL) – Official documentation on SQLite’s snapshot‑friendly WAL mode.
- RocksDB Architecture Overview – Describes how RocksDB uses immutable memtables and COW for high‑throughput workloads.
- LMDB Documentation – Copy‑On‑Write B‑Tree – Details on LMDB’s single‑writer, multi‑reader model and page‑level COW.
- Operating System Concepts – Copy‑On‑Write – Chapter from the textbook “Operating Systems: Three Easy Pieces” covering the fundamentals of COW in memory management.