Deep Dive into the Postgres Write-Ahead Log: Ensuring Data Durability and Crash Recovery

TL;DR — PostgreSQL’s Write-Ahead Log (WAL) writes every change to disk before applying it, guaranteeing durability and enabling rapid crash recovery; proper configuration and monitoring keep the system resilient under load.

PostgreSQL powers everything from fintech platforms to social media back‑ends, and its ability to survive power failures or kernel panics is a non‑negotiable requirement for any production deployment. The secret sauce behind that reliability is the Write‑Ahead Log (WAL). In this post we’ll peel back the abstraction layers, look at the on‑disk format, walk through the checkpoint algorithm, and show how you can tune the WAL for both durability and performance in real‑world environments.

How the WAL Works

Write Path Overview

When a client issues an INSERT, UPDATE, or DELETE, PostgreSQL does not immediately modify the data files on disk. Instead, it follows a strict sequence:

Generate a Log Record – The change is encoded as a WAL record containing the operation type, target page, and the new tuple data.
Assign an LSN – Each record receives a monotonically increasing Log Sequence Number (LSN). The LSN acts as a global timestamp for the write‑ahead log.
Append to WAL Buffer – The record is placed in an in‑memory WAL buffer (XLogCtl->Insert). This buffer is protected by a lightweight spinlock to avoid contention.
Flush to Disk – Before the transaction can be marked COMMIT, the buffer must be flushed to the WAL file on disk. PostgreSQL calls XLogFlush() which issues an fsync() on the underlying file descriptor.
Apply to Data Files – Once the WAL record is safely on disk, the engine proceeds to modify the heap page in shared buffers and eventually writes the dirty page back to the data file during checkpoint or background writer activity.

Because the WAL is written before the data page, crash recovery can replay the log and reconstruct a consistent database state. The process is described in the official docs: PostgreSQL WAL Overview.

Log Sequence Numbers (LSNs)

An LSN is a 64‑bit integer split into two 32‑bit parts: the segment offset and the record offset within that segment. The format is X/Y where X is the segment number and Y is the byte offset. Example: 0/16B6C0. LSNs serve three purposes:

Ordering – Every later operation has a larger LSN, guaranteeing a total order.
Recovery Point – During crash recovery, PostgreSQL scans the WAL until it reaches the last flushed LSN, then stops.
Replication Position – Streaming replication slots report the LSN that the standby has confirmed, enabling precise lag metrics.

You can view the current LSN with:

SELECT pg_current_wal_lsn();

WAL Segments and Storage Layout

Segment Files

The WAL is stored as a series of fixed‑size segment files, each typically 16 MiB (wal_segment_size). The naming scheme follows 0000000100000000000000XX, where the final two hex digits represent the segment number. PostgreSQL pre‑allocates these files in the pg_wal directory (or pg_xlog on older versions) to avoid allocation latency during heavy write bursts.

Recycling and Checkpointing

WAL segments are recycled rather than deleted. After a checkpoint, any segment whose LSN is older than the checkpoint’s redo point can be reused. The checkpoint algorithm works as follows:

Determine Checkpoint LSN – The system picks a target LSN based on checkpoint_timeout, max_wal_size, and the amount of dirty buffers.
Flush Dirty Buffers – All dirty shared buffers up to the checkpoint LSN are flushed to their data files.
Write a Checkpoint Record – A special WAL record marks the checkpoint, storing the LSN and the timeline ID.
Advance the Restart LSN – The restart point for recovery is set to the checkpoint record’s LSN, allowing older segments to be recycled.

The checkpoint aggressiveness is controlled by checkpoint_timeout, max_wal_size, and checkpoint_completion_target. A typical production setting might look like:

# postgresql.conf
wal_level = replica                # Minimal WAL needed for streaming replication
max_wal_size = 2GB                 # Upper bound before a forced checkpoint
min_wal_size = 80MB                # Lower bound to avoid excessive recycling
checkpoint_timeout = 15min        # Time between automatic checkpoints
checkpoint_completion_target = 0.9 # Spread checkpoint I/O over 90% of the interval

Architecture: WAL in a Distributed Setup

Streaming Replication

In physical streaming replication, the primary streams raw WAL bytes to one or more standbys over a TCP connection. The standby writes the received WAL to its own pg_wal directory and replays it asynchronously. The replication delay can be observed via:

SELECT pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn() AS replication_lag;

Key points:

Synchronous Replication – Setting synchronous_commit = on forces the primary to wait until the standby confirms receipt of the WAL record, eliminating data loss at the cost of latency.
Hot Standby – Standby servers can serve read‑only queries while continuously applying WAL, providing both HA and scaling.

Logical Replication

Logical replication decouples the physical WAL format by extracting changes at the row level and sending them as logical messages. It relies on the same underlying WAL but uses a different decoding plugin (pgoutput by default). This enables use cases like:

Selective Table Replication – Replicate only a subset of tables.
Cross‑Version Upgrades – Stream changes from an older PostgreSQL version to a newer one without binary compatibility concerns.

Patterns in Production

Tuning `checkpoint_timeout` and `max_wal_size`

A common performance pitfall is setting checkpoint_timeout too low, causing frequent checkpoints that flood the I/O subsystem. Conversely, an excessively high max_wal_size may let the WAL grow unchecked, consuming disk space. A pragmatic approach:

Measure Write Throughput – Use pg_stat_bgwriter to see how many buffers are written per second.
Set checkpoint_timeout – Aim for a checkpoint interval that matches the typical burst length of your workload (e.g., 10–15 minutes for batch‑heavy pipelines).
Adjust max_wal_size – Ensure the WAL can accommodate the total amount of dirty data generated between checkpoints. A rule of thumb: max_wal_size ≈ write_rate × checkpoint_timeout.

Monitoring WAL Lag

In streaming replication, lag is a leading indicator of network congestion or I/O throttling. Prometheus exporters like postgres_exporter expose metrics such as pg_replication_lag_bytes. Alert on thresholds that exceed your SLA (e.g., > 5 seconds of lag).

# Example Prometheus rule
- alert: PostgresWALLagTooHigh
  expr: pg_replication_lag_bytes > 5000000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "WAL replication lag exceeds 5 MiB"
    description: "Check network health and standby disk I/O."

Handling Disk Saturation

When the WAL volume fills up, PostgreSQL will pause writes and eventually abort transactions with ERROR: could not write to WAL. Prevent this by:

Separate WAL Disk – Mount pg_wal on a dedicated SSD with low latency.
Enable wal_compression – Reduces WAL size at the cost of CPU. (wal_compression = on)
Configure archive_command – Offload completed segments to cheap object storage (e.g., S3) to keep local disk usage low.

archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
wal_compression = on

Key Takeaways

The WAL guarantees durability by persisting every change before it touches the data files; the LSN provides a global ordering useful for recovery and replication.
Checkpoint tuning (checkpoint_timeout, max_wal_size, checkpoint_completion_target) balances I/O pressure against disk consumption.
In physical streaming replication, synchronous commit eliminates data loss but adds latency; logical replication offers flexibility for selective data sync.
Monitoring WAL lag and disk health is essential for high‑availability clusters; use Prometheus alerts and separate WAL storage to avoid bottlenecks.
Archiving and optional compression help keep the on‑disk WAL footprint manageable while preserving the ability to perform point‑in‑time recovery.

How the WAL Works#

Write Path Overview#

Log Sequence Numbers (LSNs)#

WAL Segments and Storage Layout#

Segment Files#

Recycling and Checkpointing#

Architecture: WAL in a Distributed Setup#

Streaming Replication#

Logical Replication#

Patterns in Production#

Tuning checkpoint_timeout and max_wal_size#

Monitoring WAL Lag#

Handling Disk Saturation#

Key Takeaways#

Further Reading#