TL;DR — PostgreSQL’s Write-Ahead Log (WAL) writes every change to disk before applying it, guaranteeing durability and enabling rapid crash recovery; proper configuration and monitoring keep the system resilient under load.
PostgreSQL powers everything from fintech platforms to social media back‑ends, and its ability to survive power failures or kernel panics is a non‑negotiable requirement for any production deployment. The secret sauce behind that reliability is the Write‑Ahead Log (WAL). In this post we’ll peel back the abstraction layers, look at the on‑disk format, walk through the checkpoint algorithm, and show how you can tune the WAL for both durability and performance in real‑world environments.
How the WAL Works
Write Path Overview
When a client issues an INSERT, UPDATE, or DELETE, PostgreSQL does not immediately modify the data files on disk. Instead, it follows a strict sequence:
- Generate a Log Record – The change is encoded as a WAL record containing the operation type, target page, and the new tuple data.
- Assign an LSN – Each record receives a monotonically increasing Log Sequence Number (LSN). The LSN acts as a global timestamp for the write‑ahead log.
- Append to WAL Buffer – The record is placed in an in‑memory WAL buffer (
XLogCtl->Insert). This buffer is protected by a lightweight spinlock to avoid contention. - Flush to Disk – Before the transaction can be marked
COMMIT, the buffer must be flushed to the WAL file on disk. PostgreSQL callsXLogFlush()which issues anfsync()on the underlying file descriptor. - Apply to Data Files – Once the WAL record is safely on disk, the engine proceeds to modify the heap page in shared buffers and eventually writes the dirty page back to the data file during checkpoint or background writer activity.
Because the WAL is written before the data page, crash recovery can replay the log and reconstruct a consistent database state. The process is described in the official docs: PostgreSQL WAL Overview.
Log Sequence Numbers (LSNs)
An LSN is a 64‑bit integer split into two 32‑bit parts: the segment offset and the record offset within that segment. The format is X/Y where X is the segment number and Y is the byte offset. Example: 0/16B6C0. LSNs serve three purposes:
- Ordering – Every later operation has a larger LSN, guaranteeing a total order.
- Recovery Point – During crash recovery, PostgreSQL scans the WAL until it reaches the last flushed LSN, then stops.
- Replication Position – Streaming replication slots report the LSN that the standby has confirmed, enabling precise lag metrics.
You can view the current LSN with:
SELECT pg_current_wal_lsn();
WAL Segments and Storage Layout
Segment Files
The WAL is stored as a series of fixed‑size segment files, each typically 16 MiB (wal_segment_size). The naming scheme follows 0000000100000000000000XX, where the final two hex digits represent the segment number. PostgreSQL pre‑allocates these files in the pg_wal directory (or pg_xlog on older versions) to avoid allocation latency during heavy write bursts.
Recycling and Checkpointing
WAL segments are recycled rather than deleted. After a checkpoint, any segment whose LSN is older than the checkpoint’s redo point can be reused. The checkpoint algorithm works as follows:
- Determine Checkpoint LSN – The system picks a target LSN based on
checkpoint_timeout,max_wal_size, and the amount of dirty buffers. - Flush Dirty Buffers – All dirty shared buffers up to the checkpoint LSN are flushed to their data files.
- Write a Checkpoint Record – A special WAL record marks the checkpoint, storing the LSN and the timeline ID.
- Advance the Restart LSN – The restart point for recovery is set to the checkpoint record’s LSN, allowing older segments to be recycled.
The checkpoint aggressiveness is controlled by checkpoint_timeout, max_wal_size, and checkpoint_completion_target. A typical production setting might look like:
# postgresql.conf
wal_level = replica # Minimal WAL needed for streaming replication
max_wal_size = 2GB # Upper bound before a forced checkpoint
min_wal_size = 80MB # Lower bound to avoid excessive recycling
checkpoint_timeout = 15min # Time between automatic checkpoints
checkpoint_completion_target = 0.9 # Spread checkpoint I/O over 90% of the interval
Architecture: WAL in a Distributed Setup
Streaming Replication
In physical streaming replication, the primary streams raw WAL bytes to one or more standbys over a TCP connection. The standby writes the received WAL to its own pg_wal directory and replays it asynchronously. The replication delay can be observed via:
SELECT pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn() AS replication_lag;
Key points:
- Synchronous Replication – Setting
synchronous_commit = onforces the primary to wait until the standby confirms receipt of the WAL record, eliminating data loss at the cost of latency. - Hot Standby – Standby servers can serve read‑only queries while continuously applying WAL, providing both HA and scaling.
Logical Replication
Logical replication decouples the physical WAL format by extracting changes at the row level and sending them as logical messages. It relies on the same underlying WAL but uses a different decoding plugin (pgoutput by default). This enables use cases like:
- Selective Table Replication – Replicate only a subset of tables.
- Cross‑Version Upgrades – Stream changes from an older PostgreSQL version to a newer one without binary compatibility concerns.
Patterns in Production
Tuning checkpoint_timeout and max_wal_size
A common performance pitfall is setting checkpoint_timeout too low, causing frequent checkpoints that flood the I/O subsystem. Conversely, an excessively high max_wal_size may let the WAL grow unchecked, consuming disk space. A pragmatic approach:
- Measure Write Throughput – Use
pg_stat_bgwriterto see how many buffers are written per second. - Set
checkpoint_timeout– Aim for a checkpoint interval that matches the typical burst length of your workload (e.g., 10–15 minutes for batch‑heavy pipelines). - Adjust
max_wal_size– Ensure the WAL can accommodate the total amount of dirty data generated between checkpoints. A rule of thumb:max_wal_size ≈ write_rate × checkpoint_timeout.
Monitoring WAL Lag
In streaming replication, lag is a leading indicator of network congestion or I/O throttling. Prometheus exporters like postgres_exporter expose metrics such as pg_replication_lag_bytes. Alert on thresholds that exceed your SLA (e.g., > 5 seconds of lag).
# Example Prometheus rule
- alert: PostgresWALLagTooHigh
expr: pg_replication_lag_bytes > 5000000
for: 5m
labels:
severity: warning
annotations:
summary: "WAL replication lag exceeds 5 MiB"
description: "Check network health and standby disk I/O."
Handling Disk Saturation
When the WAL volume fills up, PostgreSQL will pause writes and eventually abort transactions with ERROR: could not write to WAL. Prevent this by:
- Separate WAL Disk – Mount
pg_walon a dedicated SSD with low latency. - Enable
wal_compression– Reduces WAL size at the cost of CPU. (wal_compression = on) - Configure
archive_command– Offload completed segments to cheap object storage (e.g., S3) to keep local disk usage low.
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
wal_compression = on
Key Takeaways
- The WAL guarantees durability by persisting every change before it touches the data files; the LSN provides a global ordering useful for recovery and replication.
- Checkpoint tuning (
checkpoint_timeout,max_wal_size,checkpoint_completion_target) balances I/O pressure against disk consumption. - In physical streaming replication, synchronous commit eliminates data loss but adds latency; logical replication offers flexibility for selective data sync.
- Monitoring WAL lag and disk health is essential for high‑availability clusters; use Prometheus alerts and separate WAL storage to avoid bottlenecks.
- Archiving and optional compression help keep the on‑disk WAL footprint manageable while preserving the ability to perform point‑in‑time recovery.