TL;DR — Copy‑on‑Write (COW) lets the Linux kernel share physical pages until a write occurs, dramatically reducing memory duplication for fork(), mmap(), and copy‑on‑write filesystems. The trade‑off is extra page‑fault handling and bookkeeping, which can affect latency under heavy write workloads.

Linux’s memory subsystem is a masterclass in engineering trade‑offs. The copy‑on‑write (COW) technique, introduced decades ago, remains a cornerstone of how the kernel conserves RAM while still providing the illusion of isolated address spaces. In this article we peel back the layers of the kernel’s COW implementation, examine the data structures that make it possible, and assess the performance impact on modern workloads.

Fundamentals of Copy‑on‑Write

Historical Context

The original Unix fork() system call duplicated a process’s address space. Early implementations performed a deep copy of every page, which quickly exhausted memory on modest hardware. To solve this, Ken Thompson introduced COW in the early 1970s, allowing the parent and child to share the same physical pages marked read‑only. Only when one side attempted to write would the kernel allocate a new copy.

Linux adopted the same model, extending it to other subsystems such as mmap(MAP_PRIVATE), shared memory (shmem), and several copy‑on‑write filesystems (e.g., Btrfs, OverlayFS). The core idea is unchanged: share until dirty.

How COW Works in the Kernel

At the heart of COW are three kernel structures:

  1. struct page – represents a physical page frame, including reference count (_count) and flags like PG_private and PG_dirty.
  2. struct vm_area_struct (VMA) – describes a continuous virtual address range with its own permissions and flags (VM_WRITE, VM_MAYWRITE, VM_SHARED, VM_MAYSHARE).
  3. Page tables – the per‑process hierarchy (pgd, pud, pmd, pte) that maps virtual pages to struct page objects.

When a process calls fork(), the kernel:

int copy_process(struct task_struct *tsk, unsigned long clone_flags,
                 unsigned long stack_start, unsigned long stack_size,
                 int __user *parent_tidptr, int __user *child_tidptr,
                 unsigned long tls)
{
    // 1. Duplicate the task_struct.
    // 2. Duplicate the mm_struct (address space) via copy_mm().
    // 3. For each VMA, set VM_SHARED = false, VM_MAYWRITE = true.
    // 4. Mark all PTEs as read‑only (pte_make_writable() clears PTE_W).
}

The crucial step is step 4: every writable page in the parent’s page tables is cleared of the write bit (PTE_W). The kernel also increments the struct page reference count, so both processes point to the same physical frame.

When either process later writes to a shared page, the hardware triggers a page‑fault because the PTE is read‑only. The kernel’s fault handler (do_page_fault()) inspects the fault, determines it’s a COW fault, and calls handle_cow_fault():

static int handle_cow_fault(struct vm_area_struct *vma,
                            struct vm_fault *vmf)
{
    struct page *old_page = vmf->page;
    struct page *new_page;

    // Allocate a fresh page.
    new_page = alloc_page(vma->vm_flags);
    if (!new_page)
        return VM_FAULT_OOM;

    // Copy the content.
    copy_page(new_page, old_page);

    // Install the new page with write permission.
    vmf->pte = pte_mkwrite(pte_mkdirty(pte_mkold(pte_modify(vmf->pte,
                               __pgprot(pgprot_val(vmf->pte) | _PAGE_RW)))));
    // Decrement the reference count on the old page.
    put_page(old_page);
    return VM_FAULT_NOPAGE;
}

The net effect: the writer receives a private copy, while the other process continues to see the original untouched data.

Interaction with Page Tables and VMAs

Page‑Table Granularity

Linux’s four‑level page table (on x86‑64) works at 4 KiB granularity for normal pages, but also supports huge pages (2 MiB, 1 GiB) via the PMD and PUD levels. COW applies at the smallest level that is present and writable. When a huge page is marked COW, the kernel must decide whether to split it (a transparent huge page fault) or allocate a new huge page. The choice depends on the MADV_HUGEPAGE flag and the current memory pressure.

VMA Flags and COW Semantics

A VMA’s vm_flags control whether a region can be COW‑eligible:

FlagMeaningCOW Interaction
VM_SHAREDMapping is shared across processesNo COW – writes affect all mappings
VM_MAYSHAREMapping may be made shared laterCOW allowed if not currently shared
VM_WRITEProcess may write (subject to page‑table)Required for COW to be triggered
VM_MAYWRITEProcess might write in futureEnables lazy COW on fork

When a VMA is created with MAP_PRIVATE, the kernel automatically clears VM_SHARED and sets VM_MAYWRITE. Conversely, MAP_SHARED leaves the pages writable for all participants, bypassing COW entirely.

Benefits for Memory Efficiency

Fork‑Heavy Workloads

Consider a web server that spawns a new process per request (e.g., classic CGI). Without COW, each fork() would duplicate the entire code segment, data segment, and libraries, inflating memory usage dramatically. With COW, the parent and child share all read‑only pages (text, rodata) and even the read‑write pages until the child modifies its environment.

A benchmark on a 16 GiB system shows that spawning 10 000 fork() processes consumes roughly 200 MiB of RSS when COW is active, versus >4 GiB without it (source: Linux Kernel Documentation).

Private mmap Regions

mmap(MAP_PRIVATE|MAP_ANONYMOUS) creates a zero‑filled region that is copy‑on‑write from the start. The kernel lazily allocates physical pages only when a write occurs, allowing massive address spaces (e.g., 1 TiB) to exist virtually without backing RAM. This is essential for languages that rely on demand‑paged heaps.

Copy‑on‑Write Filesystems

Filesystems like Btrfs use COW at the block level to implement snapshots. When a file is modified, only the changed blocks are written to a new location, while the unchanged blocks remain shared between the snapshot and the live filesystem. This provides instant, space‑efficient snapshots and simplifies crash‑only design.

Pitfalls and Edge Cases

Page‑Fault Overhead

Each COW write incurs a full page‑fault path: hardware interrupt, fault validation, allocation, copy, and PTE update. In write‑intensive workloads (e.g., in‑memory databases that fork for checkpointing), this overhead can dominate latency. Profiling tools such as perf reveal that COW faults may account for >30 % of CPU time under heavy fork‑and‑write patterns.

Reference‑Count Contention

struct page->_count is an atomic counter. When many processes share a page, each write triggers an atomic decrement on the old page and an increment on the new one. On NUMA systems, this can cause cross‑node cache‑line bouncing, degrading scalability.

Transparent Huge Pages (THP) Interaction

THP aims to reduce TLB pressure by using 2 MiB pages. However, COW on a huge page forces the kernel to split the huge page into 4 KiB pages before copying, which defeats the THP benefit. The kernel mitigates this by deferring splitting until the write is actually needed, but the initial fault cost is higher.

Memory Overcommit

Linux’s overcommit model allows allocating more virtual memory than physical RAM. COW can mask the true memory pressure because many processes appear to have large RSS values while actually sharing pages. Administrators must monitor Commit_AS and CommitLimit to avoid out‑of‑memory (OOM) surprises.

Real‑World Use Cases

Fork‑Based Checkpointing

Database systems like PostgreSQL use fork() to create a child process that writes a checkpoint file while the parent continues serving queries. COW ensures the child sees a consistent snapshot of memory without halting the parent. The checkpoint writer reads pages, causing COW faults only for pages that become dirty during the write, preserving most of the shared state.

Container Runtimes

Docker and other container runtimes employ copy‑on‑write layered filesystems (e.g., OverlayFS). Each container’s root filesystem is built as a stack of read‑only image layers plus a writable upper layer. When a file is modified, only the changed blocks are copied to the upper layer, keeping the lower layers shared across containers.

Virtual Machines and KVM

KVM can use userfaultfd to lazily populate guest memory. The host kernel can present a COW mapping to the guest, allowing multiple VMs to share the same zero‑page until they write. This reduces host RAM consumption for VMs that allocate large amounts of memory but touch only a fraction.

Performance Considerations

Measuring COW Impact

A typical way to quantify COW cost is to compare fork() latency with and without the CLONE_VM flag (which disables address‑space duplication). Using time:

$ time ./fork_benchmark
real    0m0.012s
user    0m0.010s
sys     0m0.002s

Disabling COW (by forcing a deep copy via mmap with MAP_SHARED and PROT_WRITE) can increase the time by an order of magnitude.

Tuning Parameters

Linux exposes several sysctl knobs that affect COW behavior:

  • vm.overcommit_memory – controls whether allocation can exceed RAM+swap.
  • vm.swappiness – influences when swapped‑out COW pages are reclaimed.
  • vm.min_free_kbytes – reserves free memory to avoid OOM during massive COW faults.

Adjusting vm.max_map_count can also be relevant for applications that create many private mappings; exceeding the default (65530) can cause fork() failures due to VMA exhaustion.

Mitigation Strategies

  1. Pre‑fault pages – Use mlockall(MCL_CURRENT|MCL_FUTURE) in the parent before fork() if the child is expected to write heavily, turning COW faults into regular page allocations.
  2. Copy‑on‑Write Friendly Data Structures – Immutable data structures reduce the number of writes after a fork, preserving shared pages.
  3. NUMA‑aware allocation – Allocate pages on the same node as the process that will modify them to avoid cross‑node page‑fault traffic.

Future Directions

The kernel community is exploring fine‑grained COW where the write‑bit is tracked at sub‑page granularity using hardware features like Memory Protection Keys (pkeys). This could reduce fault frequency for workloads that modify only small portions of a page.

Another avenue is eBPF‑based COW monitoring, allowing administrators to attach probes to handle_cow_fault() and collect per‑process statistics with minimal overhead. Early prototypes show promise for dynamic throttling of fork‑heavy services during memory pressure.

Finally, integration with persistent memory (PMEM) may change the COW landscape. Since PMEM provides byte‑addressable durability, kernels might adopt copy‑on‑write logging at the block level rather than relying on page‑fault semantics, blending traditional filesystem COW with virtual memory.

Key Takeaways

  • COW is the linchpin of Linux’s memory efficiency, enabling cheap fork() and private mmap without duplicating pages up front.
  • Page‑fault handling is the cost: each write to a shared page triggers allocation, copy, and PTE update, which can dominate latency in write‑heavy scenarios.
  • Data structures matter: immutable or read‑only designs preserve shared pages longer, reducing the number of COW faults.
  • System tuning: parameters like vm.overcommit_memory and vm.swappiness shape how aggressively the kernel allows COW to consume virtual memory.
  • Future hardware features (pkeys, PMEM) could refine COW granularity and reduce fault overhead, but the fundamental principle of “share until dirty” will likely remain.

Further Reading