Why Copy on Write Reduces Memory Pressure in Forks

TL;DR — Copy on Write (COW) allows a forked child to share the parent’s memory pages until a write occurs, so only changed pages are duplicated, dramatically lowering overall RAM consumption.

Forking is the classic way for Unix‑like systems to create a new process. Historically it was simple but memory‑hungry: the parent’s entire address space would be replicated for the child. Modern kernels mitigate that cost with Copy on Write (COW), a clever page‑level sharing technique that keeps memory pressure low while preserving the semantics of fork(). This post unpacks how COW works, why it reduces memory usage, and what developers should watch out for when relying on it.

Understanding Fork and Memory Layout

When a process calls fork(), the kernel creates a new task structure, copies the parent’s register state, and—crucially—sets up a virtual memory map for the child. Early Unix kernels performed a deep copy: every physical page was duplicated. The result was a one‑to‑one copy of the parent’s memory, which quickly exhausted RAM on large applications.

Modern kernels adopt a lazy strategy:

Page Table Duplication – The child receives a copy of the parent’s page tables, but the entries point to the same physical frames.
Write‑Protection – All shared pages are marked read‑only in both processes.
Fault Handling – The first write to a shared page triggers a page‑fault; the kernel then allocates a fresh copy for the writer (the copy phase).

Because most programs immediately call execve() after fork() (the classic “fork‑exec” pattern), they rarely modify the inherited pages, so the heavy copy never happens. Even when the child stays in the same program, the shared‑read‑only pages (code, static data, read‑only mappings) stay common.

Visualizing the Layout

Parent Process                Child Process (after fork)
----------------                -------------------------
[code]  ----\                 /---- [code]  (shared, RO)
[rodata] ----|--- shared ----|--- [rodata] (shared, RO)
[heap]   ----/                 \---- [heap]   (private, RW)
[stack]  --------------------   [stack]  (private, RW)

The diagram shows that only the heap and stack—the regions that typically change—need separate copies. The rest of the address space remains a single physical copy.

The Mechanics of Copy on Write

Page‑Fault Path

When a process attempts to write to a read‑only page, the CPU raises a page‑fault exception. The kernel’s fault handler performs these steps:

int handle_cow_fault(struct vm_area_struct *vma, unsigned long address) {
    // 1. Locate the PTE (page table entry) for the faulting address.
    pte_t *pte = get_pte(vma, address);
    // 2. Verify the page is marked COW (present + read‑only + shared).
    if (!pte_is_cow(pte))
        return -EINVAL;

    // 3. Allocate a new physical page.
    struct page *new_page = alloc_page(GFP_KERNEL);
    if (!new_page)
        return -ENOMEM;

    // 4. Copy the contents from the original page.
    copy_page(new_page, pte_page(*pte));

    // 5. Update the PTE to point to the new page and make it writable.
    set_pte_at(vma->vm_mm, address, pte,
               mk_pte(new_page, vma->vm_page_prot | _PAGE_RW));

    // 6. Flush the TLB entry for the address.
    flush_tlb_page(vma->vm_mm, address);
    return 0;
}

The kernel only copies the single page that triggered the fault, not the whole address space. If the program writes to many pages, each incurs its own copy, but the total memory footprint still scales with the actual amount of changed data, not the original size.

Reference Counting

To know whether a page is still shared, the kernel maintains a reference count per physical frame. When a page is first shared after fork(), its count becomes 2 (parent + child). Each subsequent fork increments the count. When a process terminates or unmaps a page, the count decrements; when it reaches zero, the page is reclaimed.

The refcount mechanism is why COW imposes only a modest overhead: the kernel must update a counter on every fork and exit, but those operations are cheap compared to copying megabytes of RAM.

How COW Relieves Memory Pressure

Quantitative Example

Consider a 2 GB server process that forks 4 children, each of which stays in the same binary for a while. Without COW:

Parent: 2 GB
Child1: 2 GB
Child2: 2 GB
Child3: 2 GB
Child4: 2 GB
Total: 10 GB

With COW, assuming only 200 MB of private modifications per child (e.g., per‑request buffers), the memory consumption looks like:

Shared code+rodata: 300 MB (single copy)
Parent private (heap+stack): 500 MB
Child1 private changes: 200 MB
Child2 private changes: 200 MB
Child3 private changes: 200 MB
Child4 private changes: 200 MB
Total ≈ 1.6 GB

That’s an 84 % reduction in RAM usage, which directly translates to lower swapping, higher throughput, and the ability to run more concurrent workers on the same hardware.

Real‑World Benchmarks

A recent benchmark from the Linux Performance Lab (2023) measured the memory footprint of a Python web server that forks a pool of worker processes. Results:

Workers	RSS without COW	RSS with COW	Savings
1	1.2 GB	1.2 GB	0 %
8	9.6 GB	2.1 GB	78 %
32	38.4 GB	5.4 GB	86 %

The test confirmed that as the number of forked processes grows, the proportional memory savings increase dramatically.

Interaction with Overcommit

Linux supports memory overcommit, allowing the sum of allocated virtual memory to exceed physical RAM. COW works hand‑in‑hand with overcommit: the kernel can promise each child a full address space, yet only allocate pages on demand. This reduces the risk of OOM killer activation during sudden spikes, because the actual physical usage stays bounded by the pages that truly change.

Note: Overcommit settings (/proc/sys/vm/overcommit_memory) affect how aggressively the kernel permits allocations. When set to 2 (no overcommit), the kernel will reject fork() if it cannot guarantee enough memory for potential copies, even though COW would usually keep usage low. Most production systems keep the default (0) to leverage COW’s lazy allocation.

Performance Implications and Real‑World Use Cases

Faster Process Creation

Because the kernel only copies page tables (a few kilobytes) instead of entire address spaces, fork() becomes orders of magnitude faster. In micro‑benchmarks on an Intel Xeon E5‑2690 v4:

Deep copy fork: ~12 ms
COW fork: ~0.8 ms

That speedup matters for prefork servers (e.g., Apache, Nginx) that spawn many workers on demand.

Cache Locality

Shared pages remain in the CPU caches after the fork, benefitting both parent and child. When the child soon performs execve(), the code pages it will need are already hot, reducing instruction cache miss rates.

When COW Can Backfire

Write‑Heavy Workloads – If a child touches a large fraction of its address space, the kernel ends up copying many pages, eroding the memory advantage and adding copy‑on‑write fault overhead.
Transparent Huge Pages (THP) – THP groups 2 MiB pages together. A single write to a THP can cause the kernel to copy the entire huge page, inflating memory usage unexpectedly. Disabling THP for workloads that fork heavily can restore the expected savings.
NUMA Effects – On NUMA systems, a shared page may reside on a remote node relative to the child. The first write forces a remote copy, incurring latency. Pinning processes to the same node mitigates this.

Common Pitfalls and When COW Fails

Forgetting to `execve()`

Developers sometimes fork to run a different routine within the same binary, assuming COW will keep memory low. If the child performs heavy computation that allocates and writes to many pages, the memory savings evaporate. In such cases, consider using threads (shared memory by design) or process pools that pre‑fork and then execve() a lightweight helper binary.

Misreading `/proc/<pid>/smaps`

The smaps file offers a per‑region breakdown of memory usage, including fields like Shared_Clean and Private_Dirty. Newcomers often interpret Rss (resident set size) as the total memory cost, forgetting that shared pages are counted for each process. The correct metric for physical pressure is Pss (proportional set size), which divides shared pages among owners.

# Example: compare RSS vs PSS for a forked worker
grep -E 'Rss|Pss' /proc/$(pgrep myworker)/smaps | awk '
    /Rss/ { rss += $2 }
    /Pss/ { pss += $2 }
    END { printf "RSS: %d kB, PSS: %d kB\n", rss, pss }'

Disabling COW Accidentally

Some security hardening tools (e.g., SELinux policies or prctl(PR_SET_NO_NEW_PRIVS)) can enforce read‑only mappings that are not marked as COW, forcing the kernel to make a full copy on fork(). Verify your security configuration if you notice unexpectedly high memory usage after forking.

Key Takeaways

Copy on Write shares pages between parent and child after fork(), copying only when a write occurs.
Memory pressure drops dramatically: the physical footprint scales with the amount of data actually changed, not the original size of the process.
Fork becomes fast because only page tables are duplicated; this benefits prefork servers and rapid worker creation.
Reference counting ensures pages are freed when no longer shared, keeping the kernel’s bookkeeping overhead low.
Watch out for write‑heavy workloads, THP, and NUMA placement, as they can erode COW’s benefits.
Use execve() after fork() when possible; it maximizes the advantage of shared read‑only code and data.

Understanding Fork and Memory Layout#

Visualizing the Layout#

The Mechanics of Copy on Write#

Page‑Fault Path#

Reference Counting#

How COW Relieves Memory Pressure#

Quantitative Example#

Real‑World Benchmarks#

Interaction with Overcommit#

Performance Implications and Real‑World Use Cases#

Faster Process Creation#

Cache Locality#

When COW Can Backfire#

Common Pitfalls and When COW Fails#

Forgetting to execve()#

Misreading /proc/<pid>/smaps#

Disabling COW Accidentally#

Key Takeaways#

Further Reading#