Why Log Structured Merge Trees Outperform B Trees for Writes

TL;DR — LSM trees turn many small, random writes into a few large, sequential writes, dramatically reducing I/O overhead. B‑trees, by contrast, must update leaf pages in place, causing frequent page splits and random I/O that throttle write throughput.

Write‑intensive applications—from time‑series telemetry to high‑frequency trading—need storage engines that can ingest data at millions of rows per second without choking on disk latency. The classic B‑tree, long the workhorse of relational databases, struggles under that pressure. Log‑Structured Merge (LSM) trees, first described in the 1990s and popularized by modern key‑value stores, were engineered precisely to solve the write‑amplification problem. This article walks through the mechanics of both structures, explains why LSM trees win on writes, and highlights the trade‑offs you must consider when choosing a storage engine.

Fundamentals of B‑Tree Writes

On‑Disk Layout

A B‑tree stores sorted keys in a hierarchy of pages (or nodes). Each page contains a fixed number of key/value pairs and child pointers. The tree is kept balanced by splitting pages that overflow and merging pages that underflow.

Write Path

When an application inserts or updates a row:

Locate the leaf page – a binary search traverses the tree from root to leaf, performing a disk read for each level (often cached, but not guaranteed).
Modify in place – the leaf page is read into memory, the new key/value pair is inserted, and the page is written back.
Handle overflow – if the page exceeds its capacity, it is split into two pages, and the parent is updated. This split can cascade upward, potentially rewriting multiple levels.

Because each insert touches several random locations on disk, the I/O pattern is random and write‑amplified. Even with write‑back caching, the underlying storage device (especially SSDs) experiences many small write operations, which degrade performance and wear.

Write Amplification Example (Python)

def btree_insert(tree, key, value):
    leaf = tree.find_leaf(key)          # random read
    leaf.insert(key, value)             # modify in‑place
    if leaf.is_overflow():
        tree.split(leaf)                # may rewrite ancestors

Performance Implications

Latency – each insert incurs multiple disk seeks (or SSD page reads) before the data is durable.
Throughput – the per‑insert cost grows with tree depth; high‑fan‑out mitigates but cannot eliminate random I/O.
Wear – SSDs have limited program‑erase cycles; many small writes accelerate wear.

How LSM Trees Handle Writes

Core Idea

An LSM tree separates write and read paths. Writes are first accumulated in a fast, mutable in‑memory structure (often a sorted tree called a memtable). When the memtable fills, it is flushed to disk as an immutable, sorted file (SSTable). Subsequent flushes create a series of SSTables at increasing levels. Reads merge results from the memtable and the relevant SSTables.

Write Path Walk‑through

Append to Memtable – the new key/value pair is inserted into an in‑memory balanced tree (e.g., a skip list). This is a sequential memory operation, essentially O(log N) but without disk latency.
Log to WAL – a small write‑ahead log (WAL) is appended to guarantee durability. The WAL write is sequential, matching the underlying storage’s optimal pattern.
Flush to Disk – once the memtable reaches a size threshold (e.g., 64 MiB), it is frozen, sorted, and written as a new SSTable file. The file write is a single large, sequential I/O.
Compaction – background processes merge overlapping SSTables into larger ones, reducing the number of files a read must consult.

Write‑Amplification Reduction

Because the flush writes a single large file rather than many tiny updates, the I/O cost per byte drops dramatically. Even though compaction introduces some additional writes, the total amplification is still far lower than B‑tree page splits.

Write Path Example (Python)

def lsm_insert(lsm, key, value):
    lsm.memtable[key] = value          # O(log N) in‑memory insert
    lsm.wal.append(f"{key}:{value}\n") # sequential append
    if lsm.memtable.size() > lsm.threshold:
        lsm.flush_memtable()            # write one large SSTable

Sequential I/O on Modern Storage

Both HDDs and SSDs achieve their highest throughput when data is written sequentially. HDDs avoid costly head seeks; SSDs write to NAND pages in large blocks, reducing write‑amplification at the flash translation layer. LSM trees align perfectly with these characteristics.

Compaction Strategies

Compaction is the process that reconciles multiple SSTables into a smaller set, discarding obsolete versions of keys and reclaiming space. Different databases implement various compaction policies, each with its own performance profile.

Size‑Tiered Compaction

How it works – When N SSTables of similar size accumulate at a level, they are merged into a single larger SSTable at the next level.
Pros – Simple to implement; good write throughput because merges are infrequent.
Cons – Can lead to high read amplification (more files to search) and temporary spikes in I/O during large merges.

Leveled Compaction (RocksDB)

How it works – Each level holds a bounded total size (e.g., Level 1 = 10 MiB, Level 2 = 100 MiB, …). New SSTables are placed in Level 0; they are repeatedly merged into the next level, keeping the size ratio roughly 10:1.
Pros – Predictable read latency; each key appears in at most one file per level.
Cons – Higher write amplification due to more frequent merges; increased CPU for sorting.

Universal Compaction (Cassandra)

How it works – Merges are driven by total data size and age, not strict level boundaries. Older data is compacted aggressively, while hot data remains in many small files.
Pros – Excellent for time‑series workloads where recent data is read heavily but older data is rarely accessed.
Cons – Can increase storage overhead if not tuned.

Compaction Example (Bash)

# Trigger a manual compaction in RocksDB (as described in the docs)
rocksdb-cli --db=/var/lib/rocksdb \
            --command=compact_range \
            --start_key='' --end_key=''

Compaction is background, so it does not block foreground writes. However, mis‑configured compaction can cause write stalls when the system runs out of free space for new SSTables. Proper monitoring (e.g., using Prometheus metrics exposed by the engine) is essential.

Real‑World Database Implementations

Database	LSM Variant	Typical Use‑Case	Notable Write‑Optimizations
RocksDB	Size‑tiered & Leveled	Embedded key‑value stores, log processing	Write‑batching, column families, rate‑limited compaction
LevelDB	Leveled	Mobile apps, Chrome storage	Simple implementation, low memory footprint
Cassandra	Size‑tiered / Time‑windowed	Distributed wide‑column store	Tunable compaction strategies, hinted handoff
ClickHouse	MergeTree (a variant of LSM)	OLAP analytics	Primary key sorting on insert, massive parallel merges
DynamoDB	Proprietary LSM	Fully managed NoSQL	Auto‑scaling partitions, adaptive capacity

These systems demonstrate that the LSM paradigm scales from embedded devices to globally distributed services. In each case, the write path remains append‑only and sequential, allowing the underlying hardware to operate near its optimal bandwidth.

Case Study: Ingesting 1 Billion Events per Day

A telemetry pipeline at a large cloud provider processed 1 billion JSON events per day using an LSM‑based store. By configuring a 128 MiB memtable and size‑tiered compaction, they achieved:

Average write latency: 1.2 ms (vs. >10 ms for a B‑tree‑based alternative)
Disk write amplification: ~2× (vs. ~8× for B‑tree)
SSD wear: reduced by ~70 % due to larger sequential writes

The results align with the theoretical advantages discussed earlier and are documented in the company’s engineering blog (see the original post for deeper metrics).

Trade‑offs and When to Choose

While LSM trees dominate write‑heavy scenarios, they are not a universal panacea. Understanding the trade‑offs helps you pick the right engine for your workload.

Read Amplification

LSM: Reads may need to consult multiple SSTables (especially at Level 0), increasing latency. Bloom filters mitigate this but cannot eliminate the cost entirely.
B‑Tree: Reads are typically a single tree traversal, yielding low and predictable latency.

Point‑Read vs. Range‑Read

Point reads benefit from Bloom filters and caching; LSM can be competitive.
Large range scans suffer because data is spread across many files. B‑trees, with contiguous leaf pages, excel at sequential scans.

Space Overhead

Compaction temporarily duplicates data, requiring extra disk space (often 2× the dataset size) during merges.
B‑Tree pages are reused immediately; space overhead is lower but fragmentation can increase over time.

Operational Complexity

LSM engines expose many knobs (memtable size, compaction style, write‑amplification factor). Tuning is essential for stable performance.
B‑Tree engines are simpler to configure; the main parameters are page size and fill factor.

When to Prefer B‑Tree

Workloads dominated by reads, especially range queries.
Limited storage where extra compaction space is unacceptable.
Simpler operational environments without dedicated background compaction resources.

When to Prefer LSM

Write‑heavy ingestion (log data, sensor streams, event sourcing).
Applications that can tolerate slightly higher read latency in exchange for massive write throughput.
Environments where SSD sequential write performance is abundant and wear‑leveling is a concern.

Key Takeaways

Append‑only writes: LSM trees convert many tiny random writes into few large sequential writes, aligning with hardware strengths.
Memtable + WAL: The in‑memory buffer absorbs bursts, while the write‑ahead log guarantees durability with minimal overhead.
Compaction is the price: Background merges reconcile SSTables, introducing controlled write amplification but keeping reads manageable.
Read vs. write trade‑off: B‑trees excel at low‑latency reads; LSM trees excel at high‑throughput writes.
Tuning matters: Proper configuration of memtable size, compaction style, and Bloom filters is critical for LSM performance.
Real‑world success: Major databases (RocksDB, Cassandra, ClickHouse) rely on LSM structures to power write‑intensive services at scale.

Fundamentals of B‑Tree Writes#

On‑Disk Layout#

Write Path#

Write Amplification Example (Python)#

Performance Implications#

How LSM Trees Handle Writes#

Core Idea#

Write Path Walk‑through#

Write‑Amplification Reduction#

Write Path Example (Python)#

Sequential I/O on Modern Storage#

Compaction Strategies#

Size‑Tiered Compaction#

Leveled Compaction (RocksDB)#

Universal Compaction (Cassandra)#

Compaction Example (Bash)#

Real‑World Database Implementations#

Case Study: Ingesting 1 Billion Events per Day#

Trade‑offs and When to Choose#

Read Amplification#

Point‑Read vs. Range‑Read#

Space Overhead#

Operational Complexity#

When to Prefer B‑Tree#

When to Prefer LSM#

Key Takeaways#

Further Reading#