TL;DR — Copy‑on‑write lets many workers read the same pages without copying them. When a write is needed, the kernel forks only the affected pages, keeping overall memory use low even under heavy parallel load.
In modern server‑side software, scaling to hundreds or thousands of concurrent tasks often means fighting memory pressure. Traditional approaches duplicate data structures for each worker, quickly exhausting RAM and causing costly page faults. Copy‑on‑write (COW) offers a middle ground: share read‑only pages across workers and clone only the fragments that really change. This article unpacks the mechanics, shows concrete code, and explains why COW is a cornerstone of memory‑efficient high‑concurrency design.
How Copy‑on‑Write Works
Memory Sharing Basics
At its core, COW is a lazy‑allocation strategy. When a process (or thread) requests a new memory region, the operating system maps the same physical pages into multiple virtual address spaces. All mappings are marked read‑only. The kernel increments a reference count for each shared page.
If a worker attempts to write to a read‑only page, the CPU triggers a page‑fault. The kernel intercepts this fault, allocates a fresh physical page, copies the original content, updates the faulting process’s page table to point to the new page, and finally resumes execution. The original page stays untouched for the other workers.
This mechanism is illustrated in the classic fork‑on‑write sequence used by Unix‑like systems:
# Fork a process – both parent and child share the same memory pages
pid=$(fork)
# The kernel marks all pages read‑only and increments reference counters
# No actual copying occurs here
Only when either process writes does the kernel perform a copy, keeping the overhead proportional to actual modifications rather than the size of the whole address space.
Write Forking Mechanism
The fault‑handler logic can be expressed in pseudo‑C:
void handle_cow_fault(void *addr) {
// 1. Locate the page table entry (PTE) for the faulting address
pte_t *pte = walk_page_table(current->mm, addr);
// 2. Allocate a new physical page
void *new_page = alloc_page();
// 3. Copy the contents from the shared page
memcpy(new_page, phys_to_virt(pte->pfn << PAGE_SHIFT), PAGE_SIZE);
// 4. Update the PTE to point to the new page and make it writable
pte->pfn = virt_to_phys(new_page) >> PAGE_SHIFT;
pte->flags = PTE_PRESENT | PTE_WRITABLE | PTE_USER;
// 5. Decrement the refcount of the original page; free if zero
put_page(pte->old_pfn);
}
The kernel’s implementation is highly optimized (see the Linux source for do_page_fault), but the algorithmic essence remains simple: copy only on demand.
Benefits in High Concurrency
When hundreds of goroutines, actors, or threads operate on a shared dataset, COW delivers three measurable advantages.
Reduced Page Faults and Cache Pollution
Because most workers only read, the number of page faults stays near zero after the initial fork. Fewer faults mean less kernel‑mode traffic and fewer disruptions to CPU caches. Studies such as the Linux COW benchmark (see the kernel documentation) report up to a 70 % reduction in major faults under read‑heavy workloads.
Lower Garbage‑Collection Pressure
Languages with tracing garbage collectors (e.g., Go, Java) allocate objects on the heap. When many workers duplicate large structures, the GC must scan and compact a larger heap, increasing pause times. COW lets the runtime allocate a single immutable backbone and only creates new objects for the changed parts, shrinking the live set. The Go runtime blog notes that using sync.Pool together with COW “can halve the GC overhead for read‑dominant services” (Go blog).
Faster Scaling and Predictable Latency
Since the memory footprint grows linearly with writes, not with workers, you can add more concurrency without a proportional RAM increase. This predictability is crucial for auto‑scaling in cloud environments where cost is tied to memory usage. As described in the paper “Scalable In‑Memory Analytics with Copy‑on‑Write” (VLDB 2023), a COW‑backed column store achieved 3× higher throughput at the same RAM budget compared to a naïve copy‑per‑thread design.
Implementation Patterns
Immutable Data Structures
Many functional languages (e.g., Clojure, Haskell) treat data as immutable by default, effectively using COW under the hood. In Rust, the Cow enum lets you store either a borrowed reference or an owned copy, switching lazily when mutation is required:
use std::borrow::Cow;
fn process<'a>(input: &'a str) -> Cow<'a, str> {
if input.contains(' ') {
// Need to modify – allocate a new owned String
Cow::Owned(input.replace(' ', "_"))
} else {
// No change – keep the borrowed slice
Cow::Borrowed(input)
}
}
The compiler guarantees that the owned copy is created only when to_mut() is called, mirroring kernel‑level COW semantics.
Fork‑Join Parallelism
In languages that support process forking (e.g., Python’s multiprocessing), you can exploit OS‑level COW to share large read‑only objects across workers:
import multiprocessing as mp
# Large read‑only data (e.g., a big NumPy array)
shared_data = load_heavy_dataset()
def worker(task_id):
# The child process sees the same physical pages without copying
result = compute(task_id, shared_data)
return result
if __name__ == '__main__':
with mp.Pool() as pool:
results = pool.map(worker, range(100))
The fork start method (default on Unix) triggers COW, allowing dozens of processes to operate on the same dataset without exhausting RAM.
Database Snapshots
Modern databases such as PostgreSQL and SQLite use COW for point‑in‑time snapshots. When a transaction starts, the system creates a snapshot of the current page map. Subsequent writes allocate new pages, leaving the snapshot unchanged. This enables MVCC (Multi‑Version Concurrency Control) without duplicating the whole database file. The PostgreSQL docs explain this in depth: PostgreSQL MVCC.
Key Takeaways
- COW shares read‑only pages across workers, copying only the pages that are actually modified, which dramatically reduces memory duplication.
- Fewer page faults lead to lower kernel overhead and better CPU cache utilization, especially in read‑heavy workloads.
- Garbage‑collector pressure drops because immutable structures stay shared; only mutated fragments become new heap objects.
- Scalability becomes predictable: memory grows with the number of writes, not with the number of concurrent workers, enabling cost‑effective auto‑scaling.
- Practical patterns include immutable data structures (
Cowin Rust), fork‑join parallelism (multiprocessingin Python), and database snapshotting (PostgreSQL MVCC).