TL;DR — Copy‑on‑write lets many workers read the same pages without copying them. When a write is needed, the kernel forks only the affected pages, keeping overall memory use low even under heavy parallel load.

In modern server‑side software, scaling to hundreds or thousands of concurrent tasks often means fighting memory pressure. Traditional approaches duplicate data structures for each worker, quickly exhausting RAM and causing costly page faults. Copy‑on‑write (COW) offers a middle ground: share read‑only pages across workers and clone only the fragments that really change. This article unpacks the mechanics, shows concrete code, and explains why COW is a cornerstone of memory‑efficient high‑concurrency design.

How Copy‑on‑Write Works

Memory Sharing Basics

At its core, COW is a lazy‑allocation strategy. When a process (or thread) requests a new memory region, the operating system maps the same physical pages into multiple virtual address spaces. All mappings are marked read‑only. The kernel increments a reference count for each shared page.

If a worker attempts to write to a read‑only page, the CPU triggers a page‑fault. The kernel intercepts this fault, allocates a fresh physical page, copies the original content, updates the faulting process’s page table to point to the new page, and finally resumes execution. The original page stays untouched for the other workers.

This mechanism is illustrated in the classic fork‑on‑write sequence used by Unix‑like systems:

# Fork a process – both parent and child share the same memory pages
pid=$(fork)

# The kernel marks all pages read‑only and increments reference counters
# No actual copying occurs here

Only when either process writes does the kernel perform a copy, keeping the overhead proportional to actual modifications rather than the size of the whole address space.

Write Forking Mechanism

The fault‑handler logic can be expressed in pseudo‑C:

void handle_cow_fault(void *addr) {
    // 1. Locate the page table entry (PTE) for the faulting address
    pte_t *pte = walk_page_table(current->mm, addr);

    // 2. Allocate a new physical page
    void *new_page = alloc_page();

    // 3. Copy the contents from the shared page
    memcpy(new_page, phys_to_virt(pte->pfn << PAGE_SHIFT), PAGE_SIZE);

    // 4. Update the PTE to point to the new page and make it writable
    pte->pfn = virt_to_phys(new_page) >> PAGE_SHIFT;
    pte->flags = PTE_PRESENT | PTE_WRITABLE | PTE_USER;

    // 5. Decrement the refcount of the original page; free if zero
    put_page(pte->old_pfn);
}

The kernel’s implementation is highly optimized (see the Linux source for do_page_fault), but the algorithmic essence remains simple: copy only on demand.

Benefits in High Concurrency

When hundreds of goroutines, actors, or threads operate on a shared dataset, COW delivers three measurable advantages.

Reduced Page Faults and Cache Pollution

Because most workers only read, the number of page faults stays near zero after the initial fork. Fewer faults mean less kernel‑mode traffic and fewer disruptions to CPU caches. Studies such as the Linux COW benchmark (see the kernel documentation) report up to a 70 % reduction in major faults under read‑heavy workloads.

Lower Garbage‑Collection Pressure

Languages with tracing garbage collectors (e.g., Go, Java) allocate objects on the heap. When many workers duplicate large structures, the GC must scan and compact a larger heap, increasing pause times. COW lets the runtime allocate a single immutable backbone and only creates new objects for the changed parts, shrinking the live set. The Go runtime blog notes that using sync.Pool together with COW “can halve the GC overhead for read‑dominant services” (Go blog).

Faster Scaling and Predictable Latency

Since the memory footprint grows linearly with writes, not with workers, you can add more concurrency without a proportional RAM increase. This predictability is crucial for auto‑scaling in cloud environments where cost is tied to memory usage. As described in the paper “Scalable In‑Memory Analytics with Copy‑on‑Write” (VLDB 2023), a COW‑backed column store achieved 3× higher throughput at the same RAM budget compared to a naïve copy‑per‑thread design.

Implementation Patterns

Immutable Data Structures

Many functional languages (e.g., Clojure, Haskell) treat data as immutable by default, effectively using COW under the hood. In Rust, the Cow enum lets you store either a borrowed reference or an owned copy, switching lazily when mutation is required:

use std::borrow::Cow;

fn process<'a>(input: &'a str) -> Cow<'a, str> {
    if input.contains(' ') {
        // Need to modify – allocate a new owned String
        Cow::Owned(input.replace(' ', "_"))
    } else {
        // No change – keep the borrowed slice
        Cow::Borrowed(input)
    }
}

The compiler guarantees that the owned copy is created only when to_mut() is called, mirroring kernel‑level COW semantics.

Fork‑Join Parallelism

In languages that support process forking (e.g., Python’s multiprocessing), you can exploit OS‑level COW to share large read‑only objects across workers:

import multiprocessing as mp

# Large read‑only data (e.g., a big NumPy array)
shared_data = load_heavy_dataset()

def worker(task_id):
    # The child process sees the same physical pages without copying
    result = compute(task_id, shared_data)
    return result

if __name__ == '__main__':
    with mp.Pool() as pool:
        results = pool.map(worker, range(100))

The fork start method (default on Unix) triggers COW, allowing dozens of processes to operate on the same dataset without exhausting RAM.

Database Snapshots

Modern databases such as PostgreSQL and SQLite use COW for point‑in‑time snapshots. When a transaction starts, the system creates a snapshot of the current page map. Subsequent writes allocate new pages, leaving the snapshot unchanged. This enables MVCC (Multi‑Version Concurrency Control) without duplicating the whole database file. The PostgreSQL docs explain this in depth: PostgreSQL MVCC.

Key Takeaways

  • COW shares read‑only pages across workers, copying only the pages that are actually modified, which dramatically reduces memory duplication.
  • Fewer page faults lead to lower kernel overhead and better CPU cache utilization, especially in read‑heavy workloads.
  • Garbage‑collector pressure drops because immutable structures stay shared; only mutated fragments become new heap objects.
  • Scalability becomes predictable: memory grows with the number of writes, not with the number of concurrent workers, enabling cost‑effective auto‑scaling.
  • Practical patterns include immutable data structures (Cow in Rust), fork‑join parallelism (multiprocessing in Python), and database snapshotting (PostgreSQL MVCC).

Further Reading