TL;DR — Copy‑on‑Write (COW) lets the Linux kernel share physical pages until a write occurs, dramatically reducing memory duplication for
fork(),mmap(), and copy‑on‑write filesystems. The trade‑off is extra page‑fault handling and bookkeeping, which can affect latency under heavy write workloads.
Linux’s memory subsystem is a masterclass in engineering trade‑offs. The copy‑on‑write (COW) technique, introduced decades ago, remains a cornerstone of how the kernel conserves RAM while still providing the illusion of isolated address spaces. In this article we peel back the layers of the kernel’s COW implementation, examine the data structures that make it possible, and assess the performance impact on modern workloads.
Fundamentals of Copy‑on‑Write
Historical Context
The original Unix fork() system call duplicated a process’s address space. Early implementations performed a deep copy of every page, which quickly exhausted memory on modest hardware. To solve this, Ken Thompson introduced COW in the early 1970s, allowing the parent and child to share the same physical pages marked read‑only. Only when one side attempted to write would the kernel allocate a new copy.
Linux adopted the same model, extending it to other subsystems such as mmap(MAP_PRIVATE), shared memory (shmem), and several copy‑on‑write filesystems (e.g., Btrfs, OverlayFS). The core idea is unchanged: share until dirty.
How COW Works in the Kernel
At the heart of COW are three kernel structures:
struct page– represents a physical page frame, including reference count (_count) and flags likePG_privateandPG_dirty.struct vm_area_struct(VMA) – describes a continuous virtual address range with its own permissions and flags (VM_WRITE,VM_MAYWRITE,VM_SHARED,VM_MAYSHARE).- Page tables – the per‑process hierarchy (
pgd,pud,pmd,pte) that maps virtual pages tostruct pageobjects.
When a process calls fork(), the kernel:
int copy_process(struct task_struct *tsk, unsigned long clone_flags,
unsigned long stack_start, unsigned long stack_size,
int __user *parent_tidptr, int __user *child_tidptr,
unsigned long tls)
{
// 1. Duplicate the task_struct.
// 2. Duplicate the mm_struct (address space) via copy_mm().
// 3. For each VMA, set VM_SHARED = false, VM_MAYWRITE = true.
// 4. Mark all PTEs as read‑only (pte_make_writable() clears PTE_W).
}
The crucial step is step 4: every writable page in the parent’s page tables is cleared of the write bit (PTE_W). The kernel also increments the struct page reference count, so both processes point to the same physical frame.
When either process later writes to a shared page, the hardware triggers a page‑fault because the PTE is read‑only. The kernel’s fault handler (do_page_fault()) inspects the fault, determines it’s a COW fault, and calls handle_cow_fault():
static int handle_cow_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
{
struct page *old_page = vmf->page;
struct page *new_page;
// Allocate a fresh page.
new_page = alloc_page(vma->vm_flags);
if (!new_page)
return VM_FAULT_OOM;
// Copy the content.
copy_page(new_page, old_page);
// Install the new page with write permission.
vmf->pte = pte_mkwrite(pte_mkdirty(pte_mkold(pte_modify(vmf->pte,
__pgprot(pgprot_val(vmf->pte) | _PAGE_RW)))));
// Decrement the reference count on the old page.
put_page(old_page);
return VM_FAULT_NOPAGE;
}
The net effect: the writer receives a private copy, while the other process continues to see the original untouched data.
Interaction with Page Tables and VMAs
Page‑Table Granularity
Linux’s four‑level page table (on x86‑64) works at 4 KiB granularity for normal pages, but also supports huge pages (2 MiB, 1 GiB) via the PMD and PUD levels. COW applies at the smallest level that is present and writable. When a huge page is marked COW, the kernel must decide whether to split it (a transparent huge page fault) or allocate a new huge page. The choice depends on the MADV_HUGEPAGE flag and the current memory pressure.
VMA Flags and COW Semantics
A VMA’s vm_flags control whether a region can be COW‑eligible:
| Flag | Meaning | COW Interaction |
|---|---|---|
VM_SHARED | Mapping is shared across processes | No COW – writes affect all mappings |
VM_MAYSHARE | Mapping may be made shared later | COW allowed if not currently shared |
VM_WRITE | Process may write (subject to page‑table) | Required for COW to be triggered |
VM_MAYWRITE | Process might write in future | Enables lazy COW on fork |
When a VMA is created with MAP_PRIVATE, the kernel automatically clears VM_SHARED and sets VM_MAYWRITE. Conversely, MAP_SHARED leaves the pages writable for all participants, bypassing COW entirely.
Benefits for Memory Efficiency
Fork‑Heavy Workloads
Consider a web server that spawns a new process per request (e.g., classic CGI). Without COW, each fork() would duplicate the entire code segment, data segment, and libraries, inflating memory usage dramatically. With COW, the parent and child share all read‑only pages (text, rodata) and even the read‑write pages until the child modifies its environment.
A benchmark on a 16 GiB system shows that spawning 10 000 fork() processes consumes roughly 200 MiB of RSS when COW is active, versus >4 GiB without it (source: Linux Kernel Documentation).
Private mmap Regions
mmap(MAP_PRIVATE|MAP_ANONYMOUS) creates a zero‑filled region that is copy‑on‑write from the start. The kernel lazily allocates physical pages only when a write occurs, allowing massive address spaces (e.g., 1 TiB) to exist virtually without backing RAM. This is essential for languages that rely on demand‑paged heaps.
Copy‑on‑Write Filesystems
Filesystems like Btrfs use COW at the block level to implement snapshots. When a file is modified, only the changed blocks are written to a new location, while the unchanged blocks remain shared between the snapshot and the live filesystem. This provides instant, space‑efficient snapshots and simplifies crash‑only design.
Pitfalls and Edge Cases
Page‑Fault Overhead
Each COW write incurs a full page‑fault path: hardware interrupt, fault validation, allocation, copy, and PTE update. In write‑intensive workloads (e.g., in‑memory databases that fork for checkpointing), this overhead can dominate latency. Profiling tools such as perf reveal that COW faults may account for >30 % of CPU time under heavy fork‑and‑write patterns.
Reference‑Count Contention
struct page->_count is an atomic counter. When many processes share a page, each write triggers an atomic decrement on the old page and an increment on the new one. On NUMA systems, this can cause cross‑node cache‑line bouncing, degrading scalability.
Transparent Huge Pages (THP) Interaction
THP aims to reduce TLB pressure by using 2 MiB pages. However, COW on a huge page forces the kernel to split the huge page into 4 KiB pages before copying, which defeats the THP benefit. The kernel mitigates this by deferring splitting until the write is actually needed, but the initial fault cost is higher.
Memory Overcommit
Linux’s overcommit model allows allocating more virtual memory than physical RAM. COW can mask the true memory pressure because many processes appear to have large RSS values while actually sharing pages. Administrators must monitor Commit_AS and CommitLimit to avoid out‑of‑memory (OOM) surprises.
Real‑World Use Cases
Fork‑Based Checkpointing
Database systems like PostgreSQL use fork() to create a child process that writes a checkpoint file while the parent continues serving queries. COW ensures the child sees a consistent snapshot of memory without halting the parent. The checkpoint writer reads pages, causing COW faults only for pages that become dirty during the write, preserving most of the shared state.
Container Runtimes
Docker and other container runtimes employ copy‑on‑write layered filesystems (e.g., OverlayFS). Each container’s root filesystem is built as a stack of read‑only image layers plus a writable upper layer. When a file is modified, only the changed blocks are copied to the upper layer, keeping the lower layers shared across containers.
Virtual Machines and KVM
KVM can use userfaultfd to lazily populate guest memory. The host kernel can present a COW mapping to the guest, allowing multiple VMs to share the same zero‑page until they write. This reduces host RAM consumption for VMs that allocate large amounts of memory but touch only a fraction.
Performance Considerations
Measuring COW Impact
A typical way to quantify COW cost is to compare fork() latency with and without the CLONE_VM flag (which disables address‑space duplication). Using time:
$ time ./fork_benchmark
real 0m0.012s
user 0m0.010s
sys 0m0.002s
Disabling COW (by forcing a deep copy via mmap with MAP_SHARED and PROT_WRITE) can increase the time by an order of magnitude.
Tuning Parameters
Linux exposes several sysctl knobs that affect COW behavior:
vm.overcommit_memory– controls whether allocation can exceed RAM+swap.vm.swappiness– influences when swapped‑out COW pages are reclaimed.vm.min_free_kbytes– reserves free memory to avoid OOM during massive COW faults.
Adjusting vm.max_map_count can also be relevant for applications that create many private mappings; exceeding the default (65530) can cause fork() failures due to VMA exhaustion.
Mitigation Strategies
- Pre‑fault pages – Use
mlockall(MCL_CURRENT|MCL_FUTURE)in the parent beforefork()if the child is expected to write heavily, turning COW faults into regular page allocations. - Copy‑on‑Write Friendly Data Structures – Immutable data structures reduce the number of writes after a fork, preserving shared pages.
- NUMA‑aware allocation – Allocate pages on the same node as the process that will modify them to avoid cross‑node page‑fault traffic.
Future Directions
The kernel community is exploring fine‑grained COW where the write‑bit is tracked at sub‑page granularity using hardware features like Memory Protection Keys (pkeys). This could reduce fault frequency for workloads that modify only small portions of a page.
Another avenue is eBPF‑based COW monitoring, allowing administrators to attach probes to handle_cow_fault() and collect per‑process statistics with minimal overhead. Early prototypes show promise for dynamic throttling of fork‑heavy services during memory pressure.
Finally, integration with persistent memory (PMEM) may change the COW landscape. Since PMEM provides byte‑addressable durability, kernels might adopt copy‑on‑write logging at the block level rather than relying on page‑fault semantics, blending traditional filesystem COW with virtual memory.
Key Takeaways
- COW is the linchpin of Linux’s memory efficiency, enabling cheap
fork()and privatemmapwithout duplicating pages up front. - Page‑fault handling is the cost: each write to a shared page triggers allocation, copy, and PTE update, which can dominate latency in write‑heavy scenarios.
- Data structures matter: immutable or read‑only designs preserve shared pages longer, reducing the number of COW faults.
- System tuning: parameters like
vm.overcommit_memoryandvm.swappinessshape how aggressively the kernel allows COW to consume virtual memory. - Future hardware features (pkeys, PMEM) could refine COW granularity and reduce fault overhead, but the fundamental principle of “share until dirty” will likely remain.