TL;DR — PostgreSQL’s Write‑Ahead Log (WAL) writes every change to a sequential log before touching data files, guaranteeing durability and enabling fast crash recovery. By understanding WAL buffers, checkpoint strategy, and replication hooks, you can tune a production cluster for both safety and performance.
PostgreSQL’s reputation for reliability stems largely from its Write‑Ahead Logging subsystem. In a modern micro‑service landscape where a single outage can cascade across multiple teams, knowing exactly how WAL works, how it interacts with replication, and how to tune it for your hardware is no longer optional—it’s a core production skill. This article walks through the WAL pipeline, the architecture that powers point‑in‑time recovery (PITR) and streaming replication, and concrete configuration patterns you can apply today.
How WAL Works Under the Hood
Log Sequence Numbers (LSNs)
Every modification that PostgreSQL accepts generates a Log Sequence Number (LSN), a 64‑bit monotonic counter that uniquely identifies a byte offset inside the WAL stream. Internally PostgreSQL stores two LSNs per transaction:
| LSN type | Meaning |
|---|---|
XactXLOGStart | Position of the first record for the transaction. |
XactXLOGEnd | Position of the last record written for the transaction. |
Because LSNs are globally ordered, they become the glue that ties together recovery, replication, and hot standby. For example, a standby server can request WAL up to a specific LSN to guarantee it has applied all changes that a primary has confirmed.
WAL Buffers and Flush Policy
When a client issues INSERT, UPDATE, or DELETE, PostgreSQL writes the change into an in‑memory WAL buffer (default size 16 MiB). The buffer is flushed to the pg_wal directory under two conditions:
- Transaction commit – the commit record forces an
fsyncof the buffer to guarantee durability. - WAL buffer full – when the buffer reaches 16 MiB, PostgreSQL triggers a background flush.
The flush path is deliberately write‑ahead: the data files are not touched until the WAL record is safely on disk. This guarantees that after a power loss, the recovery process can replay the log and reconstruct a consistent state.
/* Simplified pseudo‑code from src/backend/access/transam/xlog.c */
if (XLogNeedsFlush(lsn)) {
XLogFlush(lsn); /* performs the fsync */
}
The wal_sync_method parameter controls the low‑level system call (fsync, fdatasync, open_datasync, etc.) used for the flush. On Linux with modern kernels, fdatasync is typically the fastest while still providing durability guarantees.
WAL Segment Lifecycle
WAL files are stored as fixed‑size segments (default 16 MiB). A segment is created when the previous one fills, and the naming scheme (0000000100000000000000A0) encodes the timeline, log ID, and segment number. PostgreSQL recycles old segments according to the wal_keep_size and max_wal_size settings, ensuring the directory never grows without bound.
Architecture of WAL in Production
Primary‑Standby Replication
Streaming replication copies WAL records from the primary to one or more standbys over a TCP connection. The primary runs a wal sender process that reads from the WAL buffers and writes to the network socket. Standbys run a wal receiver that writes the incoming bytes to their own pg_wal directory and then replays them.
# Primary side (in postgresql.conf)
wal_level = replica # emit enough info for logical/physical replication
max_wal_senders = 10 # number of concurrent standbys
wal_keep_size = 1GB # keep at least 1 GB of WAL for lagging replicas
The replication protocol is deliberately asynchronous by default: the primary does not wait for the standby to confirm receipt before committing. For workloads that cannot tolerate any data loss, you can enable synchronous replication.
# Enable synchronous commit on the primary
synchronous_standby_names = 'standby1,standby2' # comma‑separated list
When synchronous_commit is set to on (the default), the primary waits until at least one standby has flushed the WAL to its own durable storage before acknowledging the client commit.
Point‑In‑Time Recovery (PITR)
PITR leverages the fact that every WAL record is timestamped and ordered. To recover to a specific moment, you:
- Restore a base backup (a physical copy of the data directory).
- Replay WAL files up to the target LSN or timestamp using
pg_restore‑style recovery.
# recovery.conf (or postgresql.auto.conf in newer versions)
restore_command = 'cp /wal_archive/%f %p'
recovery_target_time = '2026-05-27 14:30:00'
During recovery, PostgreSQL runs the same WAL replay engine that it uses after a crash, but it stops once the target point is reached. This makes it possible to undo a bad migration or a user error without restoring from an older backup.
Patterns for Tuning WAL for Durability
Synchronous vs Asynchronous Commit
| Mode | Latency impact | Data loss risk |
|---|---|---|
synchronous_commit = on (default) | +0.5 ms to +2 ms per transaction (depends on network) | None (if at least one standby is synchronous) |
synchronous_commit = off | Near‑zero latency | Potential loss of the last few milliseconds of transactions |
synchronous_commit = remote_write | Waits for network ACK only | Minimal loss if standby crashes before flushing |
Production teams often adopt a hybrid approach: critical financial writes use synchronous_commit = on, while bulk analytics inserts use off. You can set it per‑session:
BEGIN;
SET LOCAL synchronous_commit TO OFF;
INSERT INTO analytics_events VALUES (...);
COMMIT;
Checkpoint Tuning
Checkpoints flush dirty buffers to disk and write a checkpoint record to WAL. The frequency of checkpoints directly affects write amplification and recovery time.
Key parameters:
| Parameter | Typical value | Effect |
|---|---|---|
checkpoint_timeout | 5 min (default) | Maximum interval between checkpoints |
max_wal_size | 2 GB – 4 GB (depends on workload) | Upper bound for WAL before a forced checkpoint |
checkpoint_completion_target | 0.9 | Spread checkpoint I/O over the interval |
A production cluster with high write throughput (e.g., logging service) may benefit from a larger max_wal_size and a longer checkpoint_timeout to reduce checkpoint‑induced I/O spikes. However, larger values increase the amount of WAL that must be replayed after a crash.
# Example tuned settings for a 64 vCPU, 256 GB RAM node
checkpoint_timeout = 15min
max_wal_size = 8GB
checkpoint_completion_target = 0.95
WAL Compression (PostgreSQL 15+)
Starting with PostgreSQL 15, you can enable WAL compression to reduce the amount of data sent over the replication stream and stored on disk.
wal_compression = on
Benchmarks in the official release notes show up to a 30 % reduction in network traffic for write‑heavy workloads, at the cost of a modest CPU overhead (< 2 %). Enable it on both primary and standby to keep the on‑disk format identical.
Common Failure Modes and Mitigations
Disk Full / I/O Saturation
When pg_wal runs out of space, the server will shut down to prevent corruption. To avoid surprise outages:
- Monitor
pg_walsize with Prometheus metrics (pg_wal_size_bytes). - Set
wal_keep_sizeconservatively and configure a WAL archive that offloads old segments to cheap object storage (e.g., AWS S3).
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
A typical archiving pipeline copies completed segments within seconds of creation, keeping the local pg_wal directory well below the max_wal_size threshold.
Corruption and pg_wal Repair
Hardware errors can corrupt WAL files. PostgreSQL provides pg_waldump to inspect WAL records and pg_resetwal to reset the WAL timeline in extreme cases. The recommended mitigation is to:
- Run
pg_checksums(PostgreSQL 12+) to detect page‑level corruption early. - Maintain a recent base backup so you can re‑initialize the cluster if necessary.
# Verify checksums
pg_checksums --check -D /var/lib/postgresql/15/main
If corruption is isolated to a single segment, you can delete it after ensuring it has been archived and let PostgreSQL recreate it on the next checkpoint.
Network Partitions in Synchronous Replication
A network partition can cause the primary to block indefinitely if all configured synchronous standbys become unreachable. To avoid a full outage:
- Use quorum‑based synchronous replication (
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'), allowing the primary to continue as long as any two standbys are reachable. - Set
wal_sender_timeoutto a reasonable value (e.g., 60 s) so that stalled senders are terminated and the primary can fall back to asynchronous mode if needed.
Key Takeaways
- WAL guarantees durability by persisting every change before data files are modified; LSNs provide a globally ordered timeline used by recovery and replication.
- Tuning
max_wal_size,checkpoint_timeout, andcheckpoint_completion_targetlets you balance I/O spikes against recovery time. - Synchronous replication eliminates data loss at the cost of latency; hybrid per‑session settings let you prioritize critical transactions.
- Archiving WAL to external storage protects against disk‑full failures and enables point‑in‑time recovery across regions.
- Regularly monitor WAL growth, checksum integrity, and replication lag to catch problems before they cause production outages.
Further Reading
- PostgreSQL Write‑Ahead Logging (WAL) documentation – official reference on WAL internals and configuration.
- High‑Performance PostgreSQL Configuration Guide (2nd Edition) – practical tuning tips from the PostgreSQL performance community.
- Streaming Replication and Failover guide – detailed walkthrough of setting up and managing physical replication.
