TL;DR — Copy on Write (COW) lets multiple processes or data structures share the same physical memory until a modification occurs, dramatically lowering RAM consumption and speeding up operations like fork. By deferring copies until they are truly needed, COW improves both memory efficiency and overall system performance.

Copy on Write is a deceptively simple idea that underpins many of the performance tricks we take for granted in modern operating systems, databases, and even high‑level programming languages. At its core, COW replaces an eager copy with a lazy one: instead of duplicating data immediately, the system lets the original and the copy point at the same memory pages, marking them read‑only. Only when a write attempt is made does the kernel—or the runtime—clone the affected page, ensuring that each participant sees its own private view. This approach yields three practical benefits: reduced memory usage, faster process creation, and more predictable latency for write‑heavy workloads.

Below we unpack the mechanics, explore real‑world use cases, and discuss the trade‑offs you need to know before relying on COW in your own projects.

What Copy on Write Is

Historical Context

The concept of copy on write first appeared in the early 1970s as a technique for optimizing virtual memory. Early Unix kernels, for example, used COW to implement the fork() system call efficiently. Instead of copying the entire address space of a parent process—a potentially massive operation—Unix would simply duplicate the page tables and set all pages to read‑only. When either process attempted to write, the kernel would intervene and create a private copy of the page. This strategy turned a linear‑time operation into a constant‑time one, enabling rapid spawning of child processes.

Later, database systems such as PostgreSQL adopted a similar idea in the form of Multi‑Version Concurrency Control (MVCC). MVCC stores several versions of a row, each visible to a specific transaction, without physically copying the underlying data for every transaction. The principle mirrors COW: data is shared until a transaction needs to modify it, at which point a new version is written.

Core Mechanism

At the hardware level, most CPUs support page‑level protection bits that can mark a page as read‑only, read‑write, or no‑access. The operating system’s memory manager can set a page to read‑only and register a page‑fault handler. When a process attempts to write to such a page, the CPU triggers a fault, transferring control to the kernel. The kernel then:

  1. Allocates a fresh physical page.
  2. Copies the contents of the original page into the new one.
  3. Updates the process’s page table to point to the new page with write permission.
  4. Resumes execution, now operating on a private copy.

Because the fault only occurs on the first write, subsequent writes to the same page incur no additional overhead—only the initial copy cost is paid. This lazy strategy is the essence of COW.

How COW Reduces Memory Footprint

Shared Page Tables

When a process forks, the child inherits a copy of the parent’s page tables. In a naïve implementation, each page table entry would refer to a separate physical page. With COW, the kernel simply increments a reference count for each shared page. As long as the reference count remains greater than one, the page stays in memory and is mapped read‑only into both processes. This sharing can be dramatic: a typical server application might allocate hundreds of megabytes of read‑only data (configuration, code, static assets). Forking dozens of workers does not multiply that memory usage.

Lazy Allocation

Beyond process creation, COW shines in data structures that need to copy large buffers. Consider a functional programming language that treats strings as immutable. When you concatenate two strings, the runtime can allocate a new descriptor that points to the existing buffers and only copies the data when a mutation is attempted (e.g., when converting to a mutable byte array). This approach is used in languages like Rust’s Arc<[T]> and Python’s bytes objects, where the underlying buffer can be shared across multiple owners until a write forces a copy.

Example: Python’s bytes Sharing

# Python 3.11 – demonstration of internal sharing (conceptual)
import sys

a = b"Hello, world!"        # immutable bytes object
b = a                        # both variables reference the same buffer
print(sys.getrefcount(a))    # reference count > 2 (including getrefcount argument)
# No copy occurs here; both point to the same memory.

When b is later converted to a mutable bytearray, Python allocates a new buffer, leaving a untouched. The copy only happens when mutability is required.

Performance Gains in Real‑World Systems

Fork in Unix/Linux

The fork() system call is the textbook example of COW’s impact. Without COW, creating a child process would involve copying every page of the parent’s address space—a costly O(N) operation where N is the size of the address space. With COW, the kernel merely copies the page table structures (a few kilobytes) and marks all pages read‑only. The time to fork() thus drops from milliseconds to microseconds, even for memory‑heavy applications.

A benchmark from the Linux kernel documentation shows that for a process using 1 GB of RAM, fork() with COW completes in roughly 1 ms, whereas a full copy would take >200 ms on the same hardware. This speedup is why web servers like Apache and Nginx historically used a prefork model: a master process forks worker processes on demand, each starting almost instantly while sharing the same code and static data.

Databases and MVCC

PostgreSQL’s MVCC implementation leverages COW at the storage engine level. Each transaction sees a snapshot of the database as it existed at the start of the transaction. When a row is updated, PostgreSQL does not overwrite the existing tuple; instead, it writes a new version to a different location and updates the transaction’s visibility map. The old version remains on disk and in memory, shared by any other transaction that started earlier.

This design offers two major advantages:

  1. Read‑only queries never block writers, because they can safely read the older version without being affected by concurrent updates.
  2. Memory usage stays bounded, as the same physical page can hold multiple row versions until vacuuming reclaims space.

Container Runtime Optimization

Modern container runtimes such as Docker and LXC use copy‑on‑write layered filesystems (e.g., OverlayFS, AUFS) to share common base images among containers. An image layer containing a Linux distribution’s core files is mounted read‑only and shared across all containers that use it. When a container writes to a file, the filesystem creates a copy of that specific block in the upper writable layer, leaving the lower layer untouched.

This strategy reduces the storage needed for dozens of containers derived from the same base image from tens of gigabytes to just a few gigabytes, while also speeding up container startup because the kernel does not need to duplicate the entire filesystem tree.

Implementation Details and Pitfalls

Reference Counting

The kernel’s page‑fault handler relies on a reference count per physical page to know when a page can be freed. Each time a new process maps a page read‑only, the count increments; each time a process unmaps or writes (triggering a copy), the count decrements. Incorrect handling of reference counts can lead to memory leaks (if counts never drop to zero) or premature deallocation (if counts are decremented too many times), both of which can cause segmentation faults or data corruption.

Write‑Fault Handling

The latency of a COW copy is dominated by the page‑fault handling path:

  1. Fault detection – the CPU raises a page‑fault exception.
  2. Kernel entry – the kernel’s fault handler is invoked.
  3. Page allocation – a free page is fetched from the buddy allocator.
  4. Copy – the kernel copies 4 KB (or larger, if using huge pages) from the original page.
  5. TLB shoot‑down – the translation‑lookaside buffer entries for the old mapping must be invalidated on all CPUs that may have cached them.

While each step is fast, the cumulative cost can be noticeable in write‑intensive workloads that repeatedly touch shared pages. To mitigate this, developers often pre‑touch pages they expect to write, forcing the copy ahead of time and smoothing out latency spikes.

Example: Pre‑touching in C

// Allocate a large buffer and pre‑touch the first byte of each page
size_t size = 256 * 1024 * 1024; // 256 MiB
char *buf = malloc(size);
for (size_t i = 0; i < size; i += 4096) {
    buf[i] = 0; // forces a page‑fault and copy if buf is shared COW memory
}

When COW Can Hurt

  • Write‑heavy workloads: If an application frequently modifies data that is initially shared, the cost of repeated page faults can outweigh the memory savings. In such cases, allocating private memory upfront may be faster.
  • Fragmentation: Each copy creates a new physical page, potentially increasing memory fragmentation over time, especially in long‑running systems with many short‑lived writes.
  • Security considerations: Sharing pages between processes can unintentionally expose data. While the read‑only flag prevents direct modification, side‑channel attacks (e.g., Spectre) can still leak information from shared pages. Some hardened kernels provide options to disable COW for security‑critical processes.

Key Takeaways

  • COW defers copying until a write occurs, turning an eager O(N) duplication into an O(1) operation for creation and O(1) per‑page cost for the first modification.
  • Memory is saved through shared read‑only pages, which are reference‑counted and kept until no process needs them.
  • Process creation (fork) and container startup become dramatically faster because only page tables are duplicated, not the entire address space.
  • Databases use COW‑style MVCC to enable concurrent reads and writes without locking, improving throughput and consistency.
  • Write‑intensive patterns can suffer from copy‑on‑write overhead, so profiling and possibly pre‑allocating private memory are advisable.
  • Correct reference‑count management and fault handling are critical to avoid leaks, crashes, or security issues.

Further Reading