How Copy-on-Write Pages Prevent Fork from Exhausting Memory

TL;DR — When a process calls fork(), the kernel doesn’t duplicate every memory page. Instead, it marks pages as copy‑on‑write, letting parent and child share the same physical pages until one of them writes, at which point only that page is copied. This strategy reduces memory pressure dramatically and prevents the system from running out of RAM during massive forking workloads.

Forking is the classic way Unix‑like systems create a new process. The naïve mental model—“make a complete snapshot of the parent’s address space”—suggests an enormous memory cost that would quickly exhaust RAM on any non‑trivial program. The reality is far more clever: modern kernels use copy‑on‑write (COW) pages to share memory between the parent and child until a write occurs. Understanding how COW works, why it’s safe, and where its limits lie is essential for systems programmers, DevOps engineers, and anyone who worries about memory consumption in containerized or high‑concurrency environments.

The Fork System Call and Memory Semantics

Traditional Copy Semantics

Historically, the fork() primitive was defined to create a duplicate of the calling process. Early operating systems without virtual memory indeed copied the entire address space, a process that required linear time and memory proportional to the program size. The cost quickly became prohibitive as programs grew into megabytes and later gigabytes.

Why Naïve Copy Is Impractical

If a 1 GB application were forked on a system with 8 GB of RAM, a straightforward copy would consume another gigabyte instantly, regardless of whether the child ever modifies its data. Multiply this by dozens of workers in a web server, and you face immediate out‑of‑memory (OOM) conditions. Modern workloads—think micro‑service containers spawning child processes for request handling—rely on the kernel’s ability to keep memory usage modest.

Copy‑On‑Write Fundamentals

Page Tables and Reference Counting

Virtual memory abstracts physical RAM into fixed‑size pages (usually 4 KB). Each process has its own page table mapping virtual addresses to physical frames. When fork() is invoked, the kernel copies the page table entries rather than the underlying frames. It also increments a reference count on each physical page, indicating that more than one address space now points to it.

Triggering a COW Fault

Pages are initially marked read‑only for both parent and child. If either process attempts to write to a shared page, the processor raises a page‑fault because the page is not writable. The kernel’s fault handler then:

Allocates a fresh physical page.
Copies the contents of the original page into the new one.
Updates the faulting process’s page table entry to point to the new page with write permissions.
Decrements the reference count on the original page (and possibly frees it if the count drops to zero).

This sequence ensures that only the pages that actually change are duplicated, keeping memory consumption close to the original program size plus the number of dirty pages.

/* Minimal example showing fork with COW semantics */
#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main(void) {
    static char buffer[4096] = "Initial data in a shared page.\n";
    pid_t pid = fork();

    if (pid == 0) {
        /* Child: modify the buffer – triggers COW */
        strcpy(buffer, "Child has written to its own copy.\n");
        printf("Child says: %s", buffer);
    } else if (pid > 0) {
        /* Parent: read without modification – stays on original page */
        printf("Parent sees: %s", buffer);
    } else {
        perror("fork");
    }
    return 0;
}

The static buffer resides in the program’s data segment, which is page‑aligned. Both processes initially see the same physical page; only the child’s write causes a copy.

Kernel Implementation Details

Linux’s `mm_struct` and `vm_area_struct`

In Linux, each process’s memory layout is described by an mm_struct. It contains a list of vm_area_struct (VMA) objects, each representing a contiguous virtual region with the same permissions. During fork(), the kernel calls copy_mm() which:

Duplicates the mm_struct (shallow copy).
Increments the reference counts on the underlying struct page objects.
Marks the VMA flags with VM_SHARED and VM_MAYWRITE cleared, forcing read‑only access.

Reference counting is implemented in struct page via the page->_count field, protected by atomic operations.

Handling Write Faults (`do_page_fault`)

When a write fault occurs, the kernel’s do_page_fault() walks the VMA list to locate the offending address. If the VMA is flagged as copy‑on‑write (VM_MAYWRITE and VM_SHARED), the handler invokes do_cow_fault(). The function performs the steps outlined earlier: allocate, copy, update page tables, and adjust reference counts.

For deeper insight, the Linux kernel documentation describes this flow in the Memory Management section: see the official Linux Kernel Documentation.

Performance and Memory Savings

Real‑World Benchmarks

A practical way to observe COW in action is to compare memory usage before and after a fork() on a large process. Consider a Python script that allocates a 500 MB list:

$ python3 -c "a = bytearray(500_000_000); import time, os; time.sleep(1); os.fork(); time.sleep(5)"

While the script runs, inspect /proc/<pid>/smaps for both parent and child. You’ll notice that RSS (Resident Set Size) does not double; instead, the Shared_Clean field shows the common pages, and Private_Dirty remains near zero until a write operation occurs.

A study by Red Hat measured that for a typical web server spawning 100 workers, COW reduced total memory consumption by ≈ 85 % compared with a full copy model. The exact numbers vary with workload, but the trend is consistent: the more read‑only the code and data, the greater the savings.

Edge Cases and Limitations

Write‑intensive Workloads: If each child immediately writes to large data structures, COW degenerates to full copying, eroding the benefit.
Transparent Huge Pages (THP): When the kernel uses 2 MB huge pages, a single write can cause a whole huge page to be copied, potentially inflating memory usage. Administrators may disable THP for latency‑sensitive forking workloads.
Memory Overcommit: Linux’s overcommit policy (/proc/sys/vm/overcommit_memory) interacts with COW. With aggressive overcommit, the kernel may allow many forks even if the potential memory usage exceeds RAM, relying on the expectation that not all pages will become dirty.

Interaction with Modern Features

`vfork`, `posix_spawn`, and Threading

vfork() creates a child that shares the parent’s memory temporarily and must call execve() or _exit() before returning. Because the child runs in the same address space, COW is not involved—any write would affect the parent directly. Consequently, vfork() is safe only when the child does not modify memory.

posix_spawn() is a higher‑level API that can avoid an explicit fork() + exec() pair by using the clone() system call with flags that bypass the COW step when possible. In practice, however, most implementations still rely on fork() internally, so COW remains the workhorse for most process creation paths.

Threaded programs complicate the picture because all threads share the same memory space. Forking a multithreaded process copies only the calling thread; the other threads disappear in the child. The kernel still marks pages COW, but developers must be careful with locks and async‑signal‑unsafe code after a fork.

Memory Cgroups and Overcommit

Container runtimes (Docker, Podman) place processes into cgroups, which enforce memory limits. COW works transparently with cgroups: the kernel accounts for shared pages only once toward the cgroup’s usage. However, if a child process in a container writes to many pages, the container may hit its memory quota faster than anticipated. Monitoring tools like cgroupfs expose memory.stat fields rss, cache, and swap that help track COW‑related pressure.

Key Takeaways

Copy‑on‑write shares pages between parent and child after fork(), copying only when a write occurs.
Reference counting on physical pages prevents premature reclamation and enables safe sharing.
Memory savings are substantial for read‑heavy workloads; benchmarks show 70‑90 % reduction compared with naïve copying.
Edge cases such as write‑intensive children, huge pages, and aggressive overcommit can diminish benefits.
Modern APIs (vfork, posix_spawn) and container cgroups interact with COW, but the underlying mechanism remains the kernel’s page‑fault handler.

The Fork System Call and Memory Semantics#

Traditional Copy Semantics#

Why Naïve Copy Is Impractical#

Copy‑On‑Write Fundamentals#

Page Tables and Reference Counting#

Triggering a COW Fault#

Kernel Implementation Details#

Linux’s mm_struct and vm_area_struct#

Handling Write Faults (do_page_fault)#

Performance and Memory Savings#

Real‑World Benchmarks#

Edge Cases and Limitations#

Interaction with Modern Features#

vfork, posix_spawn, and Threading#

Memory Cgroups and Overcommit#

Key Takeaways#

Further Reading#