Deep Dive into Copy-on-Write Semantics: Memory Management and Performance in Modern Linux Kernels

TL;DR — Copy‑on‑Write (COW) lets the Linux kernel share pages between processes until a write occurs, dramatically reducing memory pressure. Understanding the VFS, page‑cache, and fork‑path internals helps you tune performance and avoid common pitfalls in production workloads.

Copy‑on‑Write is one of the quiet workhorses of the Linux kernel. Every time a process forks, a container spawns, or a snapshot is taken, COW decides whether pages stay shared or become private. For engineers building high‑throughput services, databases, or container orchestration platforms, the hidden cost of a “copy” can be the difference between a smooth rollout and a latency spike. This article walks through the kernel’s COW implementation, the data structures that make it possible, and concrete patterns you can apply today.

The Problem COW Solves

Memory pressure in multi‑process workloads

When a parent process creates a child with fork(), naïvely copying the entire address space would double the resident set size (RSS). In server environments where dozens of workers fork from a common binary, that approach is untenable.

The COW principle

COW postpones the actual copy until a write fault occurs. The kernel marks pages as read‑only in the child’s page tables, increments a reference count on the underlying struct page, and lets both processes read the same physical memory. Only when one process attempts to write does the kernel allocate a new page, copy the contents, and update the PTE (page‑table entry) to point to the private copy.

Where COW Lives in the Kernel

VFS and the page cache

The Virtual File System (VFS) abstracts files, while the page cache stores file data in memory. When a file is mmapped (mmap(MAP_PRIVATE)), the kernel creates a copy‑on‑write mapping. The same mechanism underlies fork(), because the child inherits the parent’s mm_struct and its page‑table hierarchy.

Key structures:

Structure	Role
`struct vm_area_struct` (VMA)	Describes a contiguous virtual memory region; flags include `VM_SHARED` vs `VM_PRIVATE`.
`struct page`	Represents a physical page; contains `mapcount` (how many VMAs map it) and `count` (reference count).
`struct mm_struct`	Holds the complete address space for a process, including the radix‑tree of page tables.

When a write fault hits a COW page, the kernel executes handle_cow_fault() (found in mm/memory.c). The function performs:

Page lookup – follow_page() walks the page table to retrieve the struct page.
Copy decision – If page->mapcount > 1 the page is shared; otherwise it is already private.
Copy – copy_page() allocates a new page, copies the contents, and updates the child’s PTE.

/* Simplified excerpt from mm/memory.c */
static vm_fault_t handle_cow_fault(struct vm_fault *vmf)
{
    struct page *old_page = vmf->page;
    struct page *new_page;

    /* Allocate a fresh page */
    new_page = alloc_page(GFP_HIGHUSER);
    if (!new_page)
        return VM_FAULT_OOM;

    /* Copy contents */
    copy_page(new_page, old_page);

    /* Install new PTE */
    vmf->pte = pte_mkdirty(pte_mkwrite(pte_mkpresent(*vmf->pte)));
    set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, mk_pte(new_page, vmf->vma->vm_page_prot));

    return VM_FAULT_DONE_COW;
}

Fork path in practice

The do_fork() syscall creates a new task_struct, clones the parent’s mm_struct using copy_mm(), and then invokes mmput() on the child when it exits. The crucial COW step is the copy‑on‑write page tables performed by mm_copy():

# Example: measuring RSS before and after fork
$ cat > fork_test.c <<'EOF'
#include <stdio.h>
#include <unistd.h>
#include <sys/resource.h>
int main() {
    struct rusage ru;
    getrusage(RUSAGE_SELF, &ru);
    printf("RSS before fork: %ld kB\n", ru.ru_maxrss);
    if (fork() == 0) {
        // child sleeps
        sleep(5);
        return 0;
    }
    sleep(1);
    getrusage(RUSAGE_SELF, &ru);
    printf("RSS after fork: %ld kB\n", ru.ru_maxrss);
    return 0;
}
EOF
gcc -O2 fork_test.c -o fork_test
./fork_test

Running the binary on a typical Ubuntu 22.04 system shows only a few kilobytes increase in RSS, confirming that pages remain shared until a write occurs.

Architecture of COW in Modern Kernels

Page‑Table hierarchy

Modern kernels use a 4‑level (x86‑64) or 5‑level (ARMv8) page‑table hierarchy. Each level contains entries that can be marked read‑only (PTE_RDONLY). When a write is attempted, the hardware raises a page‑fault exception, and the kernel’s do_page_fault() dispatcher routes the fault to the appropriate handler based on the PTE flags.

Reference counting and the `mapcount` race

mapcount is atomically incremented when a VMA is duplicated (e.g., during fork). The kernel must protect this counter against concurrent writes and reads. In high‑contention environments, the page->mapcount can become a scalability bottleneck. The kernel mitigates this with per‑CPU counters and lazy de‑duplication (see the KSM subsystem for a related use case).

Interaction with Transparent Huge Pages (THP)

When THP is enabled, a 2 MiB page can be COW‑shared. However, a write to any part of the huge page forces a split into 4 KiB pages, which can be expensive. Production teams often disable THP for latency‑sensitive services or tune vm.nr_hugepages to control allocation.

+--------------------+      +--------------------+
|  Parent PTE (RO)   | ---> |  Shared struct page|
+--------------------+      +--------------------+
        |
        v  (fork)
+--------------------+      +--------------------+
| Child PTE (RO)     | ---> |  Same struct page  |
+--------------------+      +--------------------+

Write fault in child:
1. Allocate new page
2. Copy data
3. Update child PTE to RW, point to new page

Performance Implications

Latency spikes on first write

The first write to a shared page incurs:

Page allocation (alloc_page) – may involve slab allocation and NUMA node selection.
Cache line fill – the copy reads the source page into the CPU cache, then writes the destination.
TLB shootdown – updating the PTE may require inter‑processor interrupts on SMP systems.

In latency‑critical paths (e.g., request handling in a high‑QPS web server), these spikes can be measurable. Benchmarks from the LWN article “Copy‑on‑Write on the fast path” report a 30‑50 µs latency penalty for the first write on a 4 KiB page under load.

Memory fragmentation

Repeated COW writes can fragment the page cache, especially when large files are mmap’ed with MAP_PRIVATE. Fragmentation leads to higher page‑fault rates and cache‑miss penalties. Tools like slabtop and perf top can surface hot COW paths.

NUMA considerations

On NUMA hardware, allocating the private copy on the same node as the faulting CPU reduces remote memory latency. The kernel’s alloc_pages_node() respects the current CPU’s node, but if the original shared page resides on a different node, the copy incurs a remote read. Strategies:

Pin processes to the node that owns the majority of their memory (numactl --cpunodebind).
Use madvise(MADV_DONTFORK) for memory regions that should not be shared across forks (e.g., per‑worker buffers).

Interaction with container runtimes

Docker and containerd use fork/exec to start containers. When many containers share the same base image layers, COW enables rapid startup. However, excessive writes to the overlay filesystem can cause copy‑up storms, where each container writes to the same layer, triggering massive page allocations. The overlay2 driver mitigates this by using copy‑on‑write at the file‑system level, but kernel‑level COW still plays a role for shared libraries.

Patterns in Production

1. Pre‑fork warm‑up

Allocate and touch all pages that will be read frequently before forking. This forces the kernel to fault‑in pages while the process is still single‑threaded, avoiding copy‑on‑write on the critical path.

# Warm up a memory‑mapped file before forking
python - <<'PY'
import mmap, os
fd = os.open("/tmp/large.bin", os.O_RDONLY)
mm = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
# Touch every 4 KiB page
for i in range(0, len(mm), 4096):
    _ = mm[i]
mm.close()
os.close(fd)
PY

2. Use `MAP_PRIVATE` judiciously

If a workload only reads a file, prefer MAP_SHARED. This avoids the COW path entirely and allows the kernel to keep a single copy in the page cache, even across containers.

3. Disable THP for latency‑sensitive services

Add transparent_hugepage=never to the kernel boot line or set /sys/kernel/mm/transparent_hugepage/enabled to never for services where the split‑on‑write cost outweighs the benefits of reduced TLB pressure.

4. Leverage `madvise(MADV_DONTNEED)` after a write‑heavy phase

When a process finishes a burst of writes to a private mapping, calling madvise(..., MADV_DONTNEED) releases the private pages back to the system, allowing the kernel to reclaim memory sooner.

// Release private pages after processing
void release_private_mmap(void *addr, size_t len) {
    madvise(addr, len, MADV_DONTNEED);
}

5. Monitor COW with `perf` and `ftrace`

perf record -e page-faults,minor-faults,major-faults captures the rate of copy‑on‑write faults. ftrace can be enabled on handle_cow_fault to see hot paths in production.

Key Takeaways

COW reduces RSS dramatically during fork() and mmap(MAP_PRIVATE), but the first write incurs allocation, copy, and TLB shootdown costs.
Kernel data structures (struct page, mapcount, VMA flags) are the backbone of COW; understanding them helps you diagnose memory‑pressure bugs.
Transparent Huge Pages can amplify copy‑on‑write latency; disable THP for latency‑critical services or tune huge‑page allocation.
Production patterns such as pre‑fork warm‑up, selective MAP_SHARED, and explicit madvise calls mitigate COW’s hidden costs.
Observability matters: use perf, slabtop, and ftrace to spot unexpected COW activity before it becomes a performance bottleneck.

The Problem COW Solves#

Memory pressure in multi‑process workloads#

The COW principle#

Where COW Lives in the Kernel#

VFS and the page cache#

Fork path in practice#

Architecture of COW in Modern Kernels#

Page‑Table hierarchy#

Reference counting and the mapcount race#

Interaction with Transparent Huge Pages (THP)#

Performance Implications#

Latency spikes on first write#

Memory fragmentation#

NUMA considerations#

Interaction with container runtimes#

Patterns in Production#

1. Pre‑fork warm‑up#

2. Use MAP_PRIVATE judiciously#

3. Disable THP for latency‑sensitive services#

4. Leverage madvise(MADV_DONTNEED) after a write‑heavy phase#

5. Monitor COW with perf and ftrace#

Key Takeaways#

Further Reading#