TL;DR — PostgreSQL’s MVCC stores multiple row versions on disk, using
xmin/xmaxtimestamps to decide visibility. Understanding the interaction of isolation levels, vacuum, and HOT updates lets you tune for millions of concurrent transactions without sacrificing latency.
PostgreSQL powers everything from fintech transaction processors to social‑media timelines, and a large part of its scalability comes from its implementation of Multi‑Version Concurrency Control (MVCC). In this post we’ll peel back the layers of PostgreSQL’s MVCC engine, examine how the classic isolation levels are realized, and dive into the performance‑critical mechanisms—vacuum, HOT updates, and index maintenance—that keep a high‑throughput system humming at scale.
MVCC Fundamentals
How PostgreSQL Implements MVCC
At the heart of PostgreSQL’s MVCC are two hidden columns on every heap tuple:
-- Visible in system catalogs, not exposed to users
SELECT attname, attnum FROM pg_attribute WHERE attrelid = 'my_table'::regclass;
xmin– the transaction ID (XID) that created the row.xmax– the XID that deleted or superseded the row (or a special “in‑progress” marker).
When a transaction reads a row, PostgreSQL checks the current transaction’s snapshot against those XIDs:
| Condition | Visibility |
|---|---|
xmin ≤ snapshot’s xmin and (xmax is either 0 or > snapshot’s xmax) | Visible |
| otherwise | Invisible |
The snapshot itself is a list of active XIDs taken at the start of the transaction. For read‑committed isolation, a fresh snapshot is taken for each statement; for repeatable‑read and serializable, the snapshot is fixed for the whole transaction.
Note: The XID space is 32 bits, wrapping around every ~4 billion transactions. PostgreSQL mitigates wrap‑around with “freeze” vacuum, which rewrites old tuples to set
xminto a special frozen value.
Tuple Versions on Disk
Each data page (usually 8 KB) can hold multiple versions of the same logical row. When an UPDATE occurs, PostgreSQL doesn’t overwrite the existing tuple; it inserts a new version with a higher xmin. The old version stays around until a vacuum process determines that no active snapshot can see it.
+-------------------+ +-------------------+
| Page N (old) | ---> | Page N (new) |
| row A (xmin=10) | | row A (xmin=15) |
| row B (xmin=11) | | row B (xmin=11) |
+-------------------+ +-------------------+
Because readers never block writers, high concurrency is possible, but the trade‑off is increased storage pressure and the need for periodic cleanup.
Transaction Isolation Levels
PostgreSQL implements the SQL standard isolation levels, but the underlying MVCC engine makes most of the heavy lifting.
Read Committed
Snapshot per statement.
Each statement sees a snapshot that includes all transactions committed before the statement starts. This is the default because it balances consistency with low latency.
BEGIN;
SELECT balance FROM accounts WHERE id = 42; -- sees snapshot S1
UPDATE accounts SET balance = balance - 10 WHERE id = 42; -- takes new snapshot S2
COMMIT;
If another transaction commits between the SELECT and UPDATE, the UPDATE will see the newer data, avoiding “lost updates” but allowing non‑repeatable reads.
Repeatable Read
Snapshot fixed for the whole transaction.
All statements within the transaction see the same snapshot, guaranteeing that rows won’t change mid‑transaction. PostgreSQL achieves this simply by reusing the snapshot taken at BEGIN.
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT sum(amount) FROM orders WHERE status = 'open'; -- snapshot S0
-- concurrent INSERTs are invisible here
COMMIT;
Serializable
Implemented as Serializable Snapshot Isolation (SSI), not true locking. PostgreSQL tracks read‑write dependencies between concurrent transactions; if a dangerous cycle is detected, one transaction aborts with a serialization failure.
Performance tip: In high‑contention workloads, prefer
READ COMMITTEDunless you need strict repeatable semantics. The overhead of SSI can increase latency by 5‑15 % in micro‑benchmark tests (see the PostgreSQL docs on SSI).
Architecture and Performance at Scale
The Vacuum Process
Vacuum is the garbage collector for MVCC. It performs two essential jobs:
- Reclaim space from dead tuples so new rows can reuse the page.
- Freeze old XIDs to prevent wrap‑around.
There are two flavors:
| Type | Trigger | Typical Use‑Case |
|---|---|---|
VACUUM (manual) | Scheduled via cron or pg_cron | Regular maintenance on hot tables. |
AUTOVACUUM (daemon) | Config‑driven thresholds (autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor) | Continuous background cleanup. |
A well‑tuned autovacuum configuration is essential for large‑scale systems. For a table with 500 M rows and a write rate of 10 k TPS, the defaults often lag behind, causing bloat. A typical production tweak:
# postgresql.conf excerpt
autovacuum_vacuum_scale_factor = 0.02 # 2 % of table size
autovacuum_vacuum_cost_delay = 20ms # throttle impact on I/O
autovacuum_max_workers = 8 # parallel workers on multi‑CPU boxes
HOT (Heap‑Only Tuple) Updates
When an UPDATE modifies only non‑indexed columns, PostgreSQL can avoid index entry churn by creating a HOT chain—the new tuple lives on the same page, linked via ctid. The index still points to the original tuple, which now forwards to the latest version.
Benefits:
- Reduces index bloat.
- Lowers I/O for write‑heavy workloads where most updates are “payload” changes.
You can verify HOT usage with pg_freespace and pgstattuple extensions:
SELECT relname, heap_blks_hit, heap_blks_read
FROM pg_statio_user_tables
WHERE relname = 'events';
If heap_blks_hit is significantly higher than heap_blks_read, your buffer cache is doing its job and HOT is likely effective.
Index Maintenance at Scale
Large tables with many indexes can suffer from index bloat due to frequent INSERT/UPDATE/DELETE cycles. Strategies:
- Partial indexes for common query predicates.
- Covering indexes (
INCLUDEcolumns) to satisfy queries without hitting the heap. - BRIN indexes for time‑series data (e.g., logs) where column values are naturally ordered.
Example: A telemetry table storing per‑second metrics.
CREATE TABLE metrics (
ts timestamptz NOT NULL,
device_id uuid NOT NULL,
temperature float,
humidity float,
PRIMARY KEY (device_id, ts)
) PARTITION BY RANGE (ts);
A BRIN index on ts dramatically cuts index size:
CREATE INDEX ON metrics USING brin (ts);
Scaling Reads with Connection Pooling
In a microservice architecture, hundreds of services may open connections to PostgreSQL. Using a pooler like PgBouncer in transaction pooling mode reduces backend process count, improving CPU cache locality.
# pgbouncer.ini snippet
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 100
The pooler does not interfere with MVCC; it merely reuses server processes across client sessions.
Patterns in Production
1. Write‑Heavy Event Sourcing
When building an event store, each event is an immutable row. MVCC shines because readers never block writers, and the append‑only nature means vacuum pressure is low. Use partitioning by time to keep each partition small enough for effective index scans.
CREATE TABLE events (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
aggregate_id uuid NOT NULL,
event_type text NOT NULL,
payload jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);
2. Real‑Time Analytics with Materialized Views
Materialized views can be refreshed incrementally with REFRESH MATERIALIZED VIEW CONCURRENTLY, which respects MVCC and avoids exclusive locks.
CREATE MATERIALIZED VIEW daily_sales AS
SELECT date_trunc('day', order_ts) AS day,
sum(amount) AS total
FROM orders
GROUP BY 1;
-- Refresh without blocking reads
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales;
3. Handling Long‑Running Transactions
Long‑running transactions hold onto old tuple versions, preventing vacuum from cleaning them up. In production we:
- Break large batch jobs into smaller transactions.
- Use
SET statement_timeoutto abort runaway statements. - Monitor
pg_stat_activityforstate = 'idle in transaction'.
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;
4. Monitoring MVCC Health
Key metrics from pg_stat_database and pg_stat_user_tables:
| Metric | Why It Matters |
|---|---|
xact_commit / xact_rollback | Transaction throughput. |
dead_tuples | Indicates need for vacuum. |
blk_read_time / blk_write_time | I/O latency, often tied to page splits. |
conflict_* (in hot standby) | Replication conflicts caused by MVCC visibility. |
Grafana dashboards often plot dead_tuples as a percentage of total rows, alerting when it exceeds, say, 10 %.
Key Takeaways
- PostgreSQL’s MVCC stores multiple row versions; visibility is decided by
xmin/xmaxagainst a transaction snapshot. - Isolation levels are implemented on top of MVCC:
READ COMMITTEDtakes per‑statement snapshots, whileREPEATABLE READfixes the snapshot for the whole transaction, andSERIALIZABLEadds SSI conflict detection. - Vacuum (manual or autovacuum) is essential to reclaim space and prevent XID wrap‑around; tune thresholds for write‑heavy tables.
- HOT updates avoid index churn when only non‑indexed columns change, improving write performance.
- Use partitioning, BRIN indexes, and partial indexes to keep index size manageable at scale.
- Connection pooling, materialized views, and careful transaction sizing are proven patterns for production workloads.
Further Reading
- PostgreSQL MVCC Internals – Official documentation covering visibility rules and tuple storage.
- PgBouncer – Lightweight Connection Pooler – Guide to deploying PgBouncer in transaction pooling mode.
- PostgreSQL Performance Tuning Guide – Community‑maintained tool for generating
postgresql.confsettings based on hardware and workload.