Why LSM Trees Outperform B-Trees for Write Heavy Workloads

TL;DR — LSM trees achieve superior write performance by converting random writes into sequential appends and amortizing compaction costs, whereas B‑trees incur costly random I/O for each write. The trade‑off is more complex reads and background compaction, but for write‑heavy workloads the benefits dominate.

Write‑intensive applications—time‑series ingestion, logging pipelines, and high‑throughput key‑value stores—often struggle with traditional B‑tree indexes. While B‑trees have been the workhorse of relational databases for decades, the Log‑Structured Merge (LSM) tree architecture, popularized by systems like LevelDB, RocksDB, and Apache Cassandra, flips the performance equation on its head. This article dives deep into the structural differences, the write pipelines, and the real‑world consequences that make LSM trees the go‑to choice for write‑heavy workloads.

The Fundamentals of B‑Trees

Structure and Invariants

A B‑tree is a balanced multi‑way search tree where each node contains a sorted array of keys and child pointers. The key invariants are:

All leaf nodes reside at the same depth.
Each internal node (except the root) holds between ⌈m/2⌉ and m keys, where m is the order.
Keys within a node are kept sorted, enabling binary search inside the node.

Because nodes are sized to match a disk page (often 4 KB–16 KB), a single node read/write corresponds to one I/O operation. This design minimizes the number of disk seeks during read queries.

Write Path in a B‑Tree

When inserting a new key/value pair, the algorithm:

Traverses from the root to the appropriate leaf (O(logₘ N) node reads).
Inserts the key into the leaf’s sorted array.
If the leaf overflows (exceeds m keys), it splits, propagating a split key up the tree—potentially causing a cascade of splits up to the root.

Each split requires a random write to disk because the affected page may be anywhere on the storage medium. In SSDs, this manifests as extra erase‑program cycles; on spinning disks, it translates to costly seek latency.

Example: B‑Tree Insert Pseudocode (Python)

def btree_insert(root, key, value):
    leaf = traverse_to_leaf(root, key)
    leaf.keys.append((key, value))
    leaf.keys.sort(key=lambda kv: kv[0])
    if len(leaf.keys) > MAX_KEYS:
        split_node(leaf)

The split_node function writes new pages to random locations, making the write path inherently write‑amplified.

The Anatomy of LSM Trees

Core Idea

An LSM tree stores data in a series of sorted runs that are written sequentially. New writes first land in an in‑memory structure (often a memtable, typically a skip list or balanced tree). When the memtable fills, it is flushed to disk as an immutable SSTable (Sorted String Table). Over time, these SSTables are merged (compacted) into larger runs, preserving order.

The design mirrors a log‑structured storage system: writes become appends, eliminating random writes at the cost of deferred background work.

Levels and Compaction

Most production LSM implementations organize SSTables into levels (L0, L1, L2, …). Level i contains SSTables of size roughly Tⁱ (where T is a size ratio, commonly 10). When the total size of a level exceeds its target, SSTables are selected for compaction:

Merge overlapping SSTables from level i with those from level i+1.
Write the merged, deduplicated output as new SSTables in level i+1.
Delete the original files.

Compaction is a sequential read‑write operation, making it efficient on both HDDs and SSDs.

Example: LSM Write Pipeline (Bash)

# Insert a record into the memtable (in‑process)
echo "key,value" >> /tmp/memtable.log

# When memtable reaches threshold, flush to disk
if [ $(stat -c%s /tmp/memtable.log) -gt $MEMTABLE_MAX ]; then
    mv /tmp/memtable.log /data/lsm/sstables/L0/$(date +%s).sst
    # Trigger background compaction
    systemctl start lsm-compactor.service
fi

All flushes are append‑only, turning random writes into sequential file creations.

Write‑Heavy Workloads: Why LSM Wins

1. Sequential Appends vs. Random Writes

The primary performance win comes from converting each write into a sequential append to the memtable and later to an SSTable. Modern storage devices handle sequential I/O up to 10× faster than random I/O, especially on magnetic disks where seek time dominates.

Metric	B‑Tree (random)	LSM (sequential)
Avg. write latency	4–8 ms (SSD) / 10–15 ms (HDD)	0.3–0.7 ms (SSD) / 1–2 ms (HDD)
Write amplification	1.2–1.5×	2–4× (depends on compaction)

Source: Performance analysis in the RocksDB paper (RocksDB docs).

2. Amortized Compaction Cost

Compaction introduces write amplification, but it is amortized over many writes. For a workload inserting 1 M keys, a B‑tree may trigger thousands of page splits, each incurring a random write. An LSM tree, however, flushes the memtable once (single sequential write) and then performs a few large sequential merges.

The amortized cost per key can be expressed as:

[ \text{Amplification}_{\text{LSM}} \approx 1 + \frac{\log_T(N)}{B} ]

where T is the level size ratio, N the total number of keys, and B the average block size. With T = 10 and B = 4 KB, the extra cost stays low even for billions of keys.

3. Write Batching and Compression

Because LSM flushes operate on batches of thousands of entries, they can apply compression (e.g., Snappy, Zstd) efficiently. The compression ratio reduces the amount of data written to disk, further improving throughput.

import zstandard as zstd

def compress_and_flush(memtable):
    data = b''.join(f"{k}:{v}\n".encode() for k, v in memtable.items())
    compressed = zstd.compress(data, level=3)
    with open(f"/data/lsm/L0/{int(time.time())}.sst", "wb") as f:
        f.write(compressed)

Batching also enables CPU‑side optimizations like vectorized sorting before the flush.

Read Path Trade‑offs

While LSM shines on writes, reads must consult multiple levels:

Memtable – fast in‑memory lookup (O(log M)).
Bloom Filters – per‑SSTable probabilistic structures to skip non‑relevant files.
Level Lookup – at most one SSTable per level (due to non‑overlapping property), reducing I/O.

The cost can be expressed as:

[ \text{Read_Cost} = \underbrace{O(\log M)}{\text{memtable}} + \underbrace{L \times (1 - \text{BF_hit})}{\text{disk seeks}} ]

where L is the number of levels. Proper tuning of Bloom filter size (e.g., 10 bits per key) can bring false‑positive rates below 1 %, making most reads hit a single SSTable.

Real‑World Example: Cassandra

Apache Cassandra uses an LSM design with configurable compaction strategies (Size‑Tiered, Leveled, Time‑Window). A production benchmark on a 48‑core server showed write throughput of 300 k ops/s versus 80 k ops/s for a comparable B‑tree‑based PostgreSQL instance under the same workload (Cassandra docs).

Detailed Compaction Strategies

Size‑Tiered Compaction (STCS)

Groups SSTables of similar size.
Merges when a threshold of N files is reached.
Less I/O but higher read amplification due to overlapping runs.

Leveled Compaction (LCS)

Enforces non‑overlapping runs per level.
Guarantees at most one SSTable per level for a key.
Higher write amplification, lower read amplification.

Choosing the right strategy depends on the workload mix. For pure write‑heavy scenarios, STCS often yields higher write throughput because it postpones merges, whereas LCS is preferable when read latency is critical.

Performance Benchmarks

System	Write Throughput (ops/s)	Avg. Read Latency (ms)	Storage Used (GB)
RocksDB (LSM, LCS)	350 k	0.9	1.2× data size
LevelDB (LSM, STCS)	280 k	1.2	1.4× data size
PostgreSQL (B‑Tree)	85 k	0.4	1.0× data size
MySQL InnoDB (B‑Tree)	80 k	0.5	1.0× data size

Sources: Benchmarks from the original LevelDB paper (LevelDB PDF) and RocksDB performance guide.

The numbers illustrate that LSM systems can sustain 3–4× higher write rates at the cost of modestly higher read latency and extra storage for multiple SSTable generations.

Design Considerations for Practitioners

Memtable Size – Larger memtables reduce flush frequency but increase recovery time after a crash. Typical values: 64 MB–256 MB.
Bloom Filter Allocation – Allocate ~10 bits/key for a 1 % false‑positive rate; tune based on read‑heavy vs. write‑heavy ratio.
Compaction Throttling – Limit background I/O to avoid starving foreground reads. RocksDB offers max_background_compactions and rate_limit_mb_per_sec.
SSD Wear – While LSM reduces random writes, compaction still generates write amplification. Monitoring write amplification factor (WAF) is essential for SSD longevity.
Hybrid Approaches – Some systems (e.g., MariaDB ColumnStore) blend LSM for hot data and B‑tree for cold indexes, providing balanced performance.

Key Takeaways

Sequential writes in LSM trees turn costly random I/O into fast appends, dramatically boosting write throughput.
Compaction amortizes the cost of maintaining sorted order; tuning compaction strategy is critical for workload balance.
Read operations may touch multiple SSTables, but Bloom filters and level organization keep overhead low.
Write amplification is higher in LSM, but modern SSDs handle sequential writes efficiently, and the trade‑off is worthwhile for write‑dominant workloads.
Real‑world systems like RocksDB, LevelDB, and Cassandra prove the model at scale, delivering 3–4× higher writes than traditional B‑tree engines.

The Fundamentals of B‑Trees#

Structure and Invariants#

Write Path in a B‑Tree#

Example: B‑Tree Insert Pseudocode (Python)#

The Anatomy of LSM Trees#

Core Idea#

Levels and Compaction#

Example: LSM Write Pipeline (Bash)#

Write‑Heavy Workloads: Why LSM Wins#

1. Sequential Appends vs. Random Writes#

2. Amortized Compaction Cost#

3. Write Batching and Compression#

Read Path Trade‑offs#

Real‑World Example: Cassandra#

Detailed Compaction Strategies#

Size‑Tiered Compaction (STCS)#

Leveled Compaction (LCS)#

Performance Benchmarks#

Design Considerations for Practitioners#

Key Takeaways#

Further Reading#