TL;DR — Copy‑on‑Write lets the kernel share physical pages between processes until a write occurs, slashing memory use and speeding up forks, containers, and snapshots.

Linux workloads have grown from single‑process daemons to massive, multi‑tenant environments that spin up hundreds of processes per second. In that context, every megabyte of RAM saved translates into lower cloud costs, higher density, and better latency. The secret sauce behind many of those savings is Copy on Write (CoW) – a clever memory‑management trick that lets the kernel treat multiple virtual address spaces as if they own the same physical pages, only copying them when a write is required. This article dives deep into how CoW works, why it is so memory‑efficient, and where you’ll see it in action on a typical Linux system.

Understanding Copy on Write

Historical background

The concept of CoW predates Linux. Early IBM mainframes used it to implement virtual memory, and the Unix fork() system call, introduced in the 1970s, was the first widely‑used public API that relied on CoW to avoid an O(N) memory copy when creating a child process. The original Unix design deliberately made fork() cheap because the typical use‑case was “fork‑then‑exec” – create a child, replace its image with a new program, and discard the original address space almost immediately. The kernel therefore needed a way to allocate the child’s page tables without physically copying every page.

Linux inherited that design and refined it over decades. Modern kernels combine CoW with other features such as memory overcommit, transparent huge pages, and user‑space file systems, making the technique a cornerstone of everything from container runtimes to snapshot‑based backup tools.

How the kernel implements CoW

At the heart of CoW are three kernel structures:

  1. Page Frame (struct page) – represents a physical page of RAM and holds a reference count (_count).
  2. Page Table Entry (PTE) – maps a virtual address to a struct page. The PTE includes a write‑protect bit that tells the MMU whether the mapping is read‑only.
  3. VM Area (struct vm_area_struct) – groups contiguous virtual pages with the same permissions and backing object.

When a fork() occurs, the kernel:

  1. Clones the parent’s page tables – the child receives a copy of the parent’s page‑table hierarchy, but each PTE points to the same struct page as the parent.
  2. Marks all writable pages as read‑only – the kernel clears the write flag in the child’s PTEs (and often in the parent’s as well).
  3. Increments the page‑frame reference count – the shared pages now have a count of at least two.

Only when either process attempts to write to a shared page does the CPU raise a page‑fault (because the PTE is read‑only). The kernel’s fault handler (do_page_fault) then:

  1. Allocates a new physical page.
  2. Copies the contents of the original page into the new one (hence “copy on write”).
  3. Updates the faulting process’s PTE to point to the new page and restores the write permission.
  4. Decrements the reference count of the original page; if it drops to zero, the page can be reclaimed.

The same principle applies to memory‑mapped files (mmap) and private anonymous mappings (MAP_PRIVATE). In each case, the kernel lazily copies pages only when a write is observed.

Below is a minimal C example that demonstrates the kernel’s CoW behaviour after a fork():

#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>

int main(void) {
    static int shared = 42;          // static ensures it lives in the data segment
    pid_t pid = fork();

    if (pid == 0) {                  // Child
        printf("Child sees shared = %d\n", shared);
        shared = 99;                 // Triggers CoW for this page
        printf("Child changed shared to %d\n", shared);
        _exit(0);
    } else if (pid > 0) {            // Parent
        wait(NULL);                  // Wait for child to finish
        printf("Parent still sees shared = %d\n", shared);
    } else {
        perror("fork");
        return 1;
    }
    return 0;
}

Running the program prints:

Child sees shared = 42
Child changed shared to 99
Parent still sees shared = 42

Even though shared resides in the same physical page for both processes initially, the child’s write forces a private copy, leaving the parent’s view untouched.

Memory Savings Mechanics

Page sharing and reference counting

Linux’s page allocator tracks how many virtual mappings refer to each physical page via the _count field. A page that is shared between N processes will have _count == N. As long as the count remains greater than one, the kernel knows the page is a CoW candidate and will not eagerly duplicate it.

The benefit is twofold:

  • Reduced RAM consumption – Instead of N copies of the same data, the system stores a single copy plus a small amount of bookkeeping.
  • Lower page‑fault overhead – The first write to a shared page incurs a copy, but subsequent writes to the private copy are normal. In many workloads, the majority of pages remain read‑only after the fork, so the fault penalty is paid only once per page.

When pages are duplicated

A page is duplicated only when a write attempt hits a read‑only PTE. The kernel can therefore predict the memory impact of a fork by examining which pages are likely to be written. Tools such as pmap and smem show the shared versus private memory of a process, letting administrators gauge the effectiveness of CoW in real time.

Consider a typical web server that forks a worker process for each incoming connection. Most of the worker’s address space consists of the executable code, static libraries, and read‑only configuration data – all of which are shared. Only a small per‑request buffer (e.g., a request struct) is written, causing a handful of page copies. The overall memory overhead per worker can be as low as 2–4 MiB, even when the full binary is tens of megabytes.

Real‑World Use Cases

Forking processes

The classic fork()‑then‑exec() pattern is still the backbone of many init systems (systemd, upstart) and service managers. Because the parent’s address space is discarded after exec(), the child never writes to the shared pages, making the fork virtually free.

A quick way to see the effect on a live system is to compare the RSS (Resident Set Size) before and after a fork:

# Show parent memory usage
ps -o pid,rss,command -p $$

# Fork a child that immediately execs /bin/true
( sleep 0.1 && exec /bin/true ) &

# Show child memory usage (should be near zero private RSS)
ps -o pid,rss,command -p $!

You’ll notice that the child’s RSS is tiny because it inherited the parent’s pages read‑only.

Containers and LXC/Docker

Container runtimes rely heavily on CoW through overlay filesystems (overlayfs, aufs) and copy‑on‑write images. When a container starts, the runtime mounts a read‑only base image and creates a thin writable layer on top. Any file modification inside the container triggers a CoW copy of the underlying block, leaving the base image untouched. This design enables:

  • Fast container start‑up – mounting the base image is O(1).
  • Storage efficiency – dozens of containers can share the same base layers without duplicating data.
  • Snapshotting – tools like docker commit or podman save simply record the diff layer.

The kernel’s cgroup v2 memory controller also benefits from CoW because multiple containers can share the same library pages while each container’s private memory is accounted separately.

Filesystem snapshots (Btrfs, ZFS, LVM)

Modern copy‑on‑write filesystems store data in immutable blocks. When a file is modified, the filesystem writes a new block rather than overwriting the old one, preserving a point‑in‑time snapshot. While this is a filesystem‑level CoW, the underlying kernel page‑cache still participates:

  • The page cache may hold a shared page that maps to the old block.
  • A write to that page triggers a page‑fault that leads the filesystem to allocate a new block, then updates the page cache entry.

Thus, CoW at the page‑cache level complements the on‑disk CoW, providing consistent, low‑overhead snapshots that can be taken instantly.

Performance Implications

Reduced page‑fault overhead

Because the majority of pages remain shared, the number of major page faults (which require disk I/O) after a fork is dramatically lower. A study by Red Hat showed that for a typical Apache httpd worker pool, enabling CoW reduced page‑fault rates by ≈70 %, translating into smoother latency under load.

CPU cache benefits

When multiple processes read the same physical page, the L3 cache can serve those reads without additional memory traffic. Only when a process writes and triggers CoW does the cache line become dirty, prompting a write‑back. In write‑heavy workloads, the benefit diminishes, but for read‑dominant services the shared cache line improves throughput.

Benchmarks

Below is a simple Python benchmark that measures the time to fork 1 000 workers that immediately exit. The test runs on a 16‑core machine with 64 GiB RAM.

import os, time, subprocess, sys

def fork_workers(n):
    start = time.time()
    for _ in range(n):
        pid = os.fork()
        if pid == 0:
            os._exit(0)
    # Wait for all children
    while True:
        try:
            os.wait()
        except ChildProcessError:
            break
    return time.time() - start

if __name__ == "__main__":
    n = int(sys.argv[1]) if len(sys.argv) > 1 else 1000
    elapsed = fork_workers(n)
    print(f"Forked {n} workers in {elapsed:.3f}s")

Typical output:

Forked 1000 workers in 0.42s

The modest runtime demonstrates that the kernel’s CoW implementation adds only a few microseconds per fork, far cheaper than copying a full address space.

Pitfalls and Gotchas

Write‑intensive workloads

If a program writes to a large fraction of its memory after a fork, the initial savings evaporate because every written page must be copied. Database engines that fork a child for checkpointing (e.g., PostgreSQL) mitigate this by using pre‑allocated shared buffers and write‑ahead logs that stay read‑only during the fork.

Overcommit and OOM

Linux permits memory overcommit (vm.overcommit_memory=0 by default), allowing the sum of all virtual allocations to exceed physical RAM. CoW can mask the true memory pressure because many pages are shared. Administrators should monitor PSS (Proportional Set Size) via /proc/<pid>/smaps to see the effective memory use per process.

Transparent Huge Pages (THP) interactions

THP groups 2 MiB pages together. When a CoW‑shared huge page is written, the kernel must split it into regular 4 KiB pages before copying, which can be costly. Disabling THP for workloads that heavily rely on fork (echo never > /sys/kernel/mm/transparent_hugepage/enabled) often yields better performance.

Key Takeaways

  • Copy on Write shares physical pages between parent and child until a write occurs, dramatically reducing memory consumption for forks, containers, and snapshots.
  • The kernel tracks shared pages with reference counts, copying only on a page‑fault caused by a write to a read‑only mapping.
  • Real‑world systems—web servers, container runtimes, and CoW filesystems—leverage this mechanism to achieve fast start‑up times and high density.
  • Performance gains stem from fewer page faults, lower cache pressure, and minimal copy overhead; however, write‑heavy workloads can erode those benefits.
  • Administrators should monitor PSS, be aware of transparent huge page side effects, and tune overcommit settings to avoid hidden OOM surprises.

Further Reading