TL;DR — PostgreSQL’s Write-Ahead Log (WAL) is the backbone of durability; understanding its sequence numbers, checkpoint cycle, and flush policies lets you tune for low latency and zero data loss even under heavy load.
In modern micro‑service back‑ends, PostgreSQL often sits at the heart of transaction processing, analytics pipelines, and event sourcing layers. When a single node goes down or a power surge hits a data center, the system’s ability to recover without missing committed rows hinges on the Write‑Ahead Log. This post unpacks the WAL’s architecture, walks through the critical configuration knobs, and shows you how to monitor and troubleshoot it in production‑grade deployments.
What the WAL Is and Why It Matters
PostgreSQL follows the classic write‑ahead principle: before any data page is modified on disk, a description of that change is appended to a sequential log. The WAL guarantees two essential properties:
- Durability – Once a transaction is reported as committed, its redo record is safely persisted, so a crash cannot roll back that work.
- Atomicity & Consistency – Crash recovery replays only complete transactions, preserving database invariants.
Because the WAL is sequential, it sidesteps the random‑I/O penalties of updating many scattered data pages. This design is why PostgreSQL can sustain tens of thousands of TPS on commodity SSDs while still offering full ACID guarantees.
Log Sequence Numbers (LSNs)
Every WAL record is identified by a Log Sequence Number (LSN), a 64‑bit integer that monotonically increases. LSNs serve three purposes:
| Role | Explanation |
|---|---|
| Position marker | Points to the exact byte offset in the WAL file series (000000010000000000000001, etc.). |
| Replication anchor | Streaming replication slots expose the last flushed LSN to replicas, ensuring they never fall behind. |
| Checkpoint boundary | Checkpoints record the redo LSN up to which the database is guaranteed to be consistent on disk. |
You can query the current LSN with SELECT pg_current_wal_lsn(); or, in older versions, pg_current_xlog_location().
How WAL Works Under the Hood
WAL Buffers and Flushing
PostgreSQL maintains an in‑memory WAL buffer (default 16 MiB). When a transaction modifies a page, the corresponding redo entry is written to this buffer. The buffer is flushed to the pg_wal directory under two conditions:
- Transaction commit –
wal_sync_methoddetermines whether the flush isfsync,fdatasync, or a direct I/O variant. - WAL buffer full – A background writer forces a flush to avoid buffer overflow.
The flush strategy is configurable:
# postgresql.conf excerpt
wal_sync_method = fdatasync # can be open_sync, fsync, fdatasync, or dsm
wal_buffers = 16MB # size of the WAL buffer
Choosing fdatasync on Linux typically yields the best latency‑to‑durability ratio because it avoids flushing metadata that isn’t needed for recovery.
Checkpointing
A checkpoint is the moment the system guarantees that all dirty data pages up to a certain LSN have been written to the data files (base/). The checkpoint process consists of:
- Write a checkpoint record to the WAL (contains the redo LSN).
- Background writer flushes dirty buffers whose LSN ≤ checkpoint LSN.
- Update
pg_controlwith the checkpoint LSN.
The frequency of checkpoints is controlled by two parameters:
# postgresql.conf excerpt
checkpoint_timeout = 5min # max time between checkpoints
max_wal_size = 2GB # stop checkpointing if WAL < max_wal_size
min_wal_size = 80MB # keep at least this much WAL around
A too‑aggressive checkpoint schedule (e.g., checkpoint_timeout = 30s) can cause checkpoint spikes, where the background writer suddenly writes gigabytes of dirty pages, leading to latency spikes in user transactions. Conversely, a lax schedule may let the WAL grow unchecked, consuming disk space and increasing recovery time.
WAL Segments and Recycling
WAL files are fixed‑size segments (default 16 MiB). PostgreSQL pre‑allocates a pool of segments based on min_wal_size and max_wal_size. When a segment is no longer needed for recovery or replication, it is recycled rather than deleted, reducing filesystem churn.
In a high‑throughput environment, you may see dozens of segments being recycled per minute. Monitoring pg_stat_bgwriter gives you a quick view:
SELECT checkpoints_timed, checkpoints_req, buffers_checkpoint,
buffers_clean, maxwritten_clean FROM pg_stat_bgwriter;
Architecture Patterns for High‑Throughput Ingestion
When building an event‑sourced service or a real‑time analytics pipeline, you often need to write more WAL than the default configuration anticipates. Two proven patterns help keep latency low and durability high:
1. Batch Commit with synchronous_commit = off (for non‑critical paths)
If your workload can tolerate a brief window where a committed transaction might be lost on crash (e.g., logging or telemetry), disabling synchronous commit reduces the number of forced fsyncs:
# postgresql.conf
synchronous_commit = off
Production teams typically toggle this setting only for specific tables or sessions using SET LOCAL synchronous_commit = off;.
2. Logical Replication + WAL Archiving for Auditing
Streaming logical replication streams WAL changes as a series of INSERT, UPDATE, and DELETE operations. Coupled with a WAL archiver (archive_command), you obtain both a real‑time feed and a durable backup:
# /etc/postgresql/16/main/postgresql.conf
wal_level = logical
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/archive/%f && cp %p /var/lib/postgresql/archive/%f'
A downstream consumer (e.g., an Apache Kafka Connect source) can then ingest the logical stream, while the archive guarantees point‑in‑time recovery (PITR).
Configuring WAL for Production Durability
Below is a checklist that production engineers use when hardening a PostgreSQL cluster for maximum durability:
| Setting | Recommended Value (SSD) | Reason |
|---|---|---|
wal_level | replica (or logical if using logical replication) | Minimal information needed for crash recovery and streaming. |
wal_sync_method | fdatasync | Avoids unnecessary metadata flushes. |
wal_buffers | 64MB (or 1% of RAM, capped at 1 GB) | Larger buffers reduce flush frequency under bursty writes. |
commit_delay | 0 (or a few microseconds) | Introduces a tiny delay to batch fsyncs; useful on high‑concurrency systems. |
synchronous_standby_names | <standby> (if you have a synchronous replica) | Guarantees that a commit is acknowledged by at least one replica. |
full_page_writes | on (default) | Ensures that the first page modified after a checkpoint is fully written, protecting against torn pages. |
max_wal_size | 4GB (or higher depending on write volume) | Prevents frequent checkpoints during traffic spikes. |
checkpoint_completion_target | 0.9 | Spreads checkpoint I/O over a longer interval, smoothing latency. |
Applying the Settings with ALTER SYSTEM
Instead of editing postgresql.conf manually, you can use ALTER SYSTEM to make changes that survive restarts:
ALTER SYSTEM SET wal_buffers = '64MB';
ALTER SYSTEM SET checkpoint_timeout = '10min';
ALTER SYSTEM SET max_wal_size = '4GB';
SELECT pg_reload_conf(); -- reloads without a full restart
Tip: After any change, always verify the effective value with SHOW <parameter>; and check pg_settings for the source (conf_file, override, etc.).
Monitoring and Troubleshooting WAL
A production environment should have visibility into WAL generation, checkpoint latency, and replication lag. Here are the core views and metrics you need.
pg_stat_wal
SELECT wal_records, wal_fpi, wal_bytes,
wal_buffers_full, wal_write, wal_sync,
wal_write_time, wal_sync_time
FROM pg_stat_wal;
wal_write_time/wal_sync_timeexpose how much time is spent flushing the WAL. Spikes often indicate storage saturation.wal_buffers_fullcounts how many times the in‑memory buffer overflowed and forced an immediate flush.
pg_stat_replication (for streaming replicas)
SELECT pid, client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
pg_last_wal_replay_lsn() AS replay_lsn_server
FROM pg_stat_replication;
Compare sent_lsn (what the primary has sent) with replay_lsn (what the replica has applied) to gauge replication lag. A lag larger than max_replication_slots can cause the primary to retain WAL indefinitely, filling the disk.
Alerting on WAL Growth
Prometheus exporters expose pg_wal_size_bytes. A simple alert rule:
- alert: PostgresWALDiskUsageHigh
expr: pg_wal_size_bytes > 0.9 * pg_wal_disk_capacity_bytes
for: 5m
labels:
severity: critical
annotations:
summary: "WAL directory > 90% capacity"
description: "PostgreSQL WAL is consuming {{ $value | humanize1024 }} of {{ $labels.pg_wal_disk_capacity_bytes | humanize1024 }}. Check checkpoint settings."
Real‑World Failure Mode: Lost WAL Segment
If the operating system unexpectedly deletes a WAL file (e.g., due to a mis‑configured cleanup script), PostgreSQL will refuse to start, emitting:
ERROR: could not open file "pg_wal/0000000100000000000000AB": No such file or directory
Recovery steps:
- Restore the missing segment from a backup or replica archive.
- Re‑initialize the missing part with
pg_resetwal -l <lsn>only as a last resort (this breaks consistency).
Always keep at least one off‑site copy of the most recent WAL archive to avoid this scenario.
Key Takeaways
- The WAL is the single source of truth for durability; every committed transaction must have its redo record safely flushed.
- LSNs, checkpoints, and WAL segment recycling together define the recovery window and disk footprint.
- Production‑grade tuning revolves around
wal_buffers,checkpoint_timeout,max_wal_size, andwal_sync_method. - Use synchronous replication (
synchronous_standby_names) when zero‑data‑loss is a hard requirement; otherwise, batch commits or logical replication can improve throughput. - Continuous monitoring of
pg_stat_wal, replication lag, and WAL disk usage prevents silent growth and unexpected outages.
Further Reading
- PostgreSQL Documentation – Write‑Ahead Logging (WAL)
- Streaming Replication and Logical Decoding
- High‑Performance PostgreSQL Configuration Guide (PGTune)