TL;DR — PostgreSQL’s Write‑Ahead Log (WAL) is the backbone of durability and crash‑recovery. Understanding its segmented architecture, LSN flow, and tuning knobs lets you guarantee data integrity while meeting latency SLAs.
PostgreSQL’s WAL is more than a simple redo log; it is a carefully engineered pipeline that persists every change before it touches the data files. In modern micro‑service back‑ends, analytics platforms, and financial systems, the WAL is the silent guarantor that a power loss or node crash never corrupts user data. This post walks through the low‑level architecture, explains the durability guarantees PostgreSQL makes, and shows concrete configuration and monitoring patterns you can adopt today.
WAL Architecture Overview
At a high level PostgreSQL writes every modification to a write‑ahead log before any data page is flushed to the main tablespace. The WAL is stored as a series of fixed‑size segment files (default 16 MiB) in $PGDATA/pg_wal. Each segment is identified by a log sequence number (LSN) that monotonically increases.
Log Sequence Numbers (LSNs)
An LSN is a 64‑bit integer split into two 32‑bit parts: the log file identifier and the offset within that file. For example, 0/16B6C0 means “log file 0, byte offset 0x16B6C0”. Every operation that mutates data generates a record with its own LSN; the record’s LSN becomes the commit LSN when the transaction is committed.
Because LSNs are strictly increasing, they provide a natural ordering for:
- Replication – a standby can request all WAL records after a known LSN.
- Point‑in‑time recovery (PITR) – you can restore up to any LSN you have archived.
- Checkpoint coordination – the system knows which pages are safe to evict.
Segmented Log Files
PostgreSQL pre‑allocates WAL segments to avoid costly fsync on every write. When a segment fills, the server creates the next one and, after a checkpoint, recycles old segments that are no longer needed for crash recovery or replication.
The segment lifecycle is:
- Active – currently being written to.
- Ready – full but not yet archived.
- Recycled – after a checkpoint confirms all needed data is persisted elsewhere.
You can observe the state with pg_walfile_name(lsn) or by inspecting the pg_wal directory.
Durability Guarantees and Write Paths
PostgreSQL’s durability promise is expressed in three commit modes, each trading latency for safety. The underlying I/O path is the same: the WAL record is written to the OS page cache, then fsync‑ed to durable storage.
Synchronous vs Asynchronous Commit
| Mode | Guarantees | Typical Latency |
|---|---|---|
| synchronous_commit = on (default) | WAL record is fsync‑ed before transaction reports success. Guarantees durability even after a crash. | 1‑5 ms (SSD) |
| synchronous_commit = remote_apply | In streaming replication, the primary waits until the standby has applied the WAL, not just received it. Provides cross‑region durability. | 5‑30 ms |
| synchronous_commit = off | Transaction returns before WAL is flushed. If the server crashes, committed data may be lost. Useful for bulk loads where occasional loss is acceptable. | <1 ms |
The choice is controlled per‑transaction (SET synchronous_commit TO ...;) or globally in postgresql.conf.
Crash Recovery Flow
When PostgreSQL starts after an abrupt shutdown, it runs REDO:
- Read the latest checkpoint record to know the recovery target LSN.
- Scan WAL from that LSN forward, applying each record to the data files in memory.
- Stop once it reaches the end of the WAL or a user‑specified recovery target (e.g., for PITR).
The recovery process is deterministic because every data change is recorded in the WAL. As described in the official docs, “the WAL guarantees that the database can be restored to a consistent state after any crash” (PostgreSQL WAL docs).
Implementing WAL in Production
Understanding the theory is only half the battle; the real value comes from configuring, monitoring, and tuning WAL for your workload.
Configuring Checkpoint Intervals
Checkpoints flush dirty data pages to disk and recycle old WAL segments. Two key parameters:
# postgresql.conf
checkpoint_timeout = 5min # maximum time between checkpoints
max_wal_size = 2GB # stop creating new WAL files beyond this size
min_wal_size = 1GB # keep at least this much WAL around
checkpoint_completion_target = 0.9 # spread I/O over 90 % of checkpoint interval
A shorter checkpoint_timeout reduces recovery time but increases I/O spikes. Production systems often tune max_wal_size to fit the storage budget while ensuring the write‑ahead log never throttles during peak load.
Monitoring WAL Activity
PostgreSQL exposes several statistics views:
SELECT
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_written,
pg_size_pretty(pg_wal_lsn_diff(pg_last_checkpoint_lsn(), pg_current_wal_lsn())) AS lag_since_checkpoint,
checkpoint_time,
checkpoint_location
FROM pg_stat_bgwriter;
Grafana dashboards built on pg_stat_activity, pg_stat_replication, and pg_stat_wal can alert when:
- WAL write rate exceeds the storage subsystem’s bandwidth.
- Checkpoint duration spikes, indicating I/O contention.
- Replication lag (standby
pg_last_wal_receive_lsn()vs primarypg_current_wal_lsn()) grows beyond SLA.
Tuning for Low Latency
If your application demands sub‑millisecond commit latency, consider:
- Dedicated WAL disk – place
$PGDATA/pg_walon a high‑throughput NVMe volume separate from data files. wal_sync_method = fdatasync– on Linux,fdatasyncis often faster thanfsyncfor WAL because it skips metadata flushing.wal_compression = on– reduces write volume at the cost of CPU; useful when network‑bound replication is the bottleneck.
Example of enabling compression:
wal_compression = on
Patterns in Production
Streaming Replication with WAL
Streaming replication streams WAL records over TCP to one or more standby servers. The primary writes to its local WAL, then ships each segment to the standby(s) in near‑real‑time.
Key settings:
wal_level = replica # minimal level required for streaming
max_wal_senders = 10 # number of concurrent replication connections
wal_keep_size = 1GB # retain recent WAL locally for lagging standbys
On the standby, configure primary_conninfo and enable hot_standby = on to allow read‑only queries while applying WAL. This pattern provides high availability and read scaling without sacrificing durability.
Point‑in‑Time Recovery (PITR)
PITR lets you restore the database to any moment captured in your archived WAL. The workflow:
- Base backup –
pg_basebackup -D /backup --format=tar. - Continuous archiving – set
archive_mode = onand a command such as:archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f' - Restore – place the base backup, then replay archived WAL up to the desired LSN using
recovery_target_timeorrecovery_target_lsn.
Because each WAL record is immutable, you can guarantee that the recovered state is identical to the one that existed at the target moment.
Key Takeaways
- PostgreSQL’s WAL is a segmented, LSN‑ordered log that guarantees durability by persisting every change before data files are updated.
- Commit modes (
synchronous_commit) let you balance latency against safety; choose the level that matches your SLA. - Proper checkpoint configuration (
max_wal_size,checkpoint_timeout) prevents uncontrolled WAL growth while keeping recovery windows short. - Monitoring
pg_stat_bgwriterandpg_stat_replicationgives early warnings of I/O saturation or replication lag. - Production patterns such as streaming replication and PITR rely on the WAL’s immutable nature; configure
wal_level,archive_mode, and replication slots accordingly.
Further Reading
- PostgreSQL WAL Documentation – official reference for WAL internals and recovery.
- Runtime Configuration – WAL Settings – exhaustive list of tunable parameters.
- Understanding PostgreSQL WAL – practical deep‑dive with diagrams and real‑world examples.