Deep Dive into Postgres Write-Ahead Logging: Ensuring Data Durability and Crash Recovery in Production

TL;DR — PostgreSQL’s Write‑Ahead Log (WAL) writes every change to a sequential log before touching data files, guaranteeing durability and enabling fast crash recovery. By understanding WAL buffers, checkpoint strategy, and replication hooks, you can tune a production cluster for both safety and performance.

PostgreSQL’s reputation for reliability stems largely from its Write‑Ahead Logging subsystem. In a modern micro‑service landscape where a single outage can cascade across multiple teams, knowing exactly how WAL works, how it interacts with replication, and how to tune it for your hardware is no longer optional—it’s a core production skill. This article walks through the WAL pipeline, the architecture that powers point‑in‑time recovery (PITR) and streaming replication, and concrete configuration patterns you can apply today.

How WAL Works Under the Hood

Log Sequence Numbers (LSNs)

Every modification that PostgreSQL accepts generates a Log Sequence Number (LSN), a 64‑bit monotonic counter that uniquely identifies a byte offset inside the WAL stream. Internally PostgreSQL stores two LSNs per transaction:

LSN type	Meaning
`XactXLOGStart`	Position of the first record for the transaction.
`XactXLOGEnd`	Position of the last record written for the transaction.

Because LSNs are globally ordered, they become the glue that ties together recovery, replication, and hot standby. For example, a standby server can request WAL up to a specific LSN to guarantee it has applied all changes that a primary has confirmed.

WAL Buffers and Flush Policy

When a client issues INSERT, UPDATE, or DELETE, PostgreSQL writes the change into an in‑memory WAL buffer (default size 16 MiB). The buffer is flushed to the pg_wal directory under two conditions:

Transaction commit – the commit record forces an fsync of the buffer to guarantee durability.
WAL buffer full – when the buffer reaches 16 MiB, PostgreSQL triggers a background flush.

The flush path is deliberately write‑ahead: the data files are not touched until the WAL record is safely on disk. This guarantees that after a power loss, the recovery process can replay the log and reconstruct a consistent state.

/* Simplified pseudo‑code from src/backend/access/transam/xlog.c */
if (XLogNeedsFlush(lsn)) {
    XLogFlush(lsn);   /* performs the fsync */
}

The wal_sync_method parameter controls the low‑level system call (fsync, fdatasync, open_datasync, etc.) used for the flush. On Linux with modern kernels, fdatasync is typically the fastest while still providing durability guarantees.

WAL Segment Lifecycle

WAL files are stored as fixed‑size segments (default 16 MiB). A segment is created when the previous one fills, and the naming scheme (0000000100000000000000A0) encodes the timeline, log ID, and segment number. PostgreSQL recycles old segments according to the wal_keep_size and max_wal_size settings, ensuring the directory never grows without bound.

Architecture of WAL in Production

Primary‑Standby Replication

Streaming replication copies WAL records from the primary to one or more standbys over a TCP connection. The primary runs a wal sender process that reads from the WAL buffers and writes to the network socket. Standbys run a wal receiver that writes the incoming bytes to their own pg_wal directory and then replays them.

# Primary side (in postgresql.conf)
wal_level = replica          # emit enough info for logical/physical replication
max_wal_senders = 10         # number of concurrent standbys
wal_keep_size = 1GB          # keep at least 1 GB of WAL for lagging replicas

The replication protocol is deliberately asynchronous by default: the primary does not wait for the standby to confirm receipt before committing. For workloads that cannot tolerate any data loss, you can enable synchronous replication.

# Enable synchronous commit on the primary
synchronous_standby_names = 'standby1,standby2'  # comma‑separated list

When synchronous_commit is set to on (the default), the primary waits until at least one standby has flushed the WAL to its own durable storage before acknowledging the client commit.

Point‑In‑Time Recovery (PITR)

PITR leverages the fact that every WAL record is timestamped and ordered. To recover to a specific moment, you:

Restore a base backup (a physical copy of the data directory).
Replay WAL files up to the target LSN or timestamp using pg_restore‑style recovery.

# recovery.conf (or postgresql.auto.conf in newer versions)
restore_command = 'cp /wal_archive/%f %p'
recovery_target_time = '2026-05-27 14:30:00'

During recovery, PostgreSQL runs the same WAL replay engine that it uses after a crash, but it stops once the target point is reached. This makes it possible to undo a bad migration or a user error without restoring from an older backup.

Patterns for Tuning WAL for Durability

Synchronous vs Asynchronous Commit

Mode	Latency impact	Data loss risk
`synchronous_commit = on` (default)	+0.5 ms to +2 ms per transaction (depends on network)	None (if at least one standby is synchronous)
`synchronous_commit = off`	Near‑zero latency	Potential loss of the last few milliseconds of transactions
`synchronous_commit = remote_write`	Waits for network ACK only	Minimal loss if standby crashes before flushing

Production teams often adopt a hybrid approach: critical financial writes use synchronous_commit = on, while bulk analytics inserts use off. You can set it per‑session:

BEGIN;
SET LOCAL synchronous_commit TO OFF;
INSERT INTO analytics_events VALUES (...);
COMMIT;

Checkpoint Tuning

Checkpoints flush dirty buffers to disk and write a checkpoint record to WAL. The frequency of checkpoints directly affects write amplification and recovery time.

Key parameters:

Parameter	Typical value	Effect
`checkpoint_timeout`	5 min (default)	Maximum interval between checkpoints
`max_wal_size`	2 GB – 4 GB (depends on workload)	Upper bound for WAL before a forced checkpoint
`checkpoint_completion_target`	0.9	Spread checkpoint I/O over the interval

A production cluster with high write throughput (e.g., logging service) may benefit from a larger max_wal_size and a longer checkpoint_timeout to reduce checkpoint‑induced I/O spikes. However, larger values increase the amount of WAL that must be replayed after a crash.

# Example tuned settings for a 64 vCPU, 256 GB RAM node
checkpoint_timeout = 15min
max_wal_size = 8GB
checkpoint_completion_target = 0.95

WAL Compression (PostgreSQL 15+)

Starting with PostgreSQL 15, you can enable WAL compression to reduce the amount of data sent over the replication stream and stored on disk.

wal_compression = on

Benchmarks in the official release notes show up to a 30 % reduction in network traffic for write‑heavy workloads, at the cost of a modest CPU overhead (< 2 %). Enable it on both primary and standby to keep the on‑disk format identical.

Common Failure Modes and Mitigations

Disk Full / I/O Saturation

When pg_wal runs out of space, the server will shut down to prevent corruption. To avoid surprise outages:

Monitor pg_wal size with Prometheus metrics (pg_wal_size_bytes).
Set wal_keep_size conservatively and configure a WAL archive that offloads old segments to cheap object storage (e.g., AWS S3).

archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'

A typical archiving pipeline copies completed segments within seconds of creation, keeping the local pg_wal directory well below the max_wal_size threshold.

Corruption and `pg_wal` Repair

Hardware errors can corrupt WAL files. PostgreSQL provides pg_waldump to inspect WAL records and pg_resetwal to reset the WAL timeline in extreme cases. The recommended mitigation is to:

Run pg_checksums (PostgreSQL 12+) to detect page‑level corruption early.
Maintain a recent base backup so you can re‑initialize the cluster if necessary.

# Verify checksums
pg_checksums --check -D /var/lib/postgresql/15/main

If corruption is isolated to a single segment, you can delete it after ensuring it has been archived and let PostgreSQL recreate it on the next checkpoint.

Network Partitions in Synchronous Replication

A network partition can cause the primary to block indefinitely if all configured synchronous standbys become unreachable. To avoid a full outage:

Use quorum‑based synchronous replication (synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'), allowing the primary to continue as long as any two standbys are reachable.
Set wal_sender_timeout to a reasonable value (e.g., 60 s) so that stalled senders are terminated and the primary can fall back to asynchronous mode if needed.

Key Takeaways

WAL guarantees durability by persisting every change before data files are modified; LSNs provide a globally ordered timeline used by recovery and replication.
Tuning max_wal_size, checkpoint_timeout, and checkpoint_completion_target lets you balance I/O spikes against recovery time.
Synchronous replication eliminates data loss at the cost of latency; hybrid per‑session settings let you prioritize critical transactions.
Archiving WAL to external storage protects against disk‑full failures and enables point‑in‑time recovery across regions.
Regularly monitor WAL growth, checksum integrity, and replication lag to catch problems before they cause production outages.

How WAL Works Under the Hood#

Log Sequence Numbers (LSNs)#

WAL Buffers and Flush Policy#

WAL Segment Lifecycle#

Architecture of WAL in Production#

Primary‑Standby Replication#

Point‑In‑Time Recovery (PITR)#

Patterns for Tuning WAL for Durability#

Synchronous vs Asynchronous Commit#

Checkpoint Tuning#

WAL Compression (PostgreSQL 15+)#

Common Failure Modes and Mitigations#

Disk Full / I/O Saturation#

Corruption and pg_wal Repair#

Network Partitions in Synchronous Replication#

Key Takeaways#

Further Reading#