TL;DR — LSM trees batch writes into immutable files and use carefully tuned compaction pipelines, turning many small random writes into sequential I/O. This design cuts write amplification dramatically, making LSM‑based storage engines the backbone of high‑throughput distributed databases.
Distributed databases need to ingest massive streams of data while keeping latency low and storage costs predictable. Traditional B‑tree engines suffer from random‑write overhead: each update may trigger page splits, index rebalancing, and costly disk seeks, inflating the amount of data actually written to storage. Log‑Structured Merge (LSM) trees turn that problem on its head by turning writes into sequential appends and deferring expensive reorganizations to background compaction jobs. The result is a dramatic reduction in write amplification—the ratio of bytes written to storage versus bytes supplied by the client. In this article we unpack the inner workings of LSM trees, explore the compaction strategies that keep amplification low, and discuss how distributed systems like Cassandra, RocksDB, and ScyllaDB adapt these ideas for scale.
Understanding Write Amplification
Write amplification is a performance metric that matters to anyone who cares about SSD endurance, network bandwidth, or operating costs in a cloud environment. It can be expressed simply:
write_amplification = (bytes_written_to_disk) / (bytes_received_from_clients)
When the amplification factor climbs above 2× or 3×, the storage layer starts to dominate latency budgets and hardware wear. In a B‑tree, a single INSERT may cause:
- A write to the leaf page.
- A write to the parent page if the leaf split.
- Additional writes to the log (WAL) for durability.
All of these happen synchronously, often causing random I/O. By contrast, an LSM tree writes the new record to an in‑memory memtable and appends a log record to a write‑ahead log (WAL). The heavy lifting—merging, sorting, and reorganizing data—occurs later, in large sequential passes that are far cheaper on modern storage media. The net effect is a lower amplification factor, typically in the 1.2×–2× range for well‑tuned deployments.
“Write amplification is the silent killer of SSD lifespan; reducing it is a primary design goal for any high‑write workload.” — The RocksDB documentation
LSM Tree Fundamentals
At a high level, an LSM tree consists of three logical components:
- Memtable – an ordered in‑memory data structure (often a skip list) that absorbs writes.
- Write‑Ahead Log (WAL) – an append‑only file guaranteeing durability before the memtable is flushed.
- Sorted String Tables (SSTables) – immutable, on‑disk files generated when the memtable reaches a size threshold.
When the memtable fills, the engine performs a flush: the memtable is frozen, its contents are sorted (if not already), and written as a new SSTable at Level 0. Level 0 files may overlap in key ranges, which is acceptable because they are short‑lived. Over time, background compaction jobs merge overlapping SSTables into higher levels (Level 1, Level 2, …), each level having stricter non‑overlap guarantees and larger target file sizes.
Write Path Example (Python)
# Simplified LSM write path
def write(key, value, memtable, wal, max_memtable_size):
# 1. Append to WAL for durability
wal.append(f"{key}:{value}\n")
# 2. Insert into in‑memory skip list
memtable.insert(key, value)
# 3. Flush if threshold exceeded
if memtable.size() >= max_memtable_size:
flush_memtable(memtable, wal)
The flush_memtable routine creates a new SSTable and resets the memtable, allowing the system to continue accepting writes with minimal pause.
Why Sequential I/O Wins
Modern SSDs and NVMe drives excel at sequential writes, achieving up to 3× higher throughput than random writes. By converting many tiny client writes into a single large SSTable write, LSM trees exploit this hardware characteristic. Moreover, sequential writes reduce write amplification at the device level because the SSD’s internal garbage collection can operate more efficiently on contiguous data.
Compaction Strategies that Limit Amplification
Compaction is the heart of amplification control. Different strategies trade off write amplification, read amplification, and space amplification. The two most common designs are Leveled Compaction and Tiered Compaction.
Leveled Compaction (LC)
Leveled compaction enforces a strict size ratio between consecutive levels (typically 10×). Each level contains non‑overlapping SSTables, and when a level exceeds its quota, files are merged into the next level, discarding deleted or overwritten keys.
Pros
- Low read amplification: a key is found in at most one SSTable per level.
- Predictable space usage.
Cons
- Higher write amplification because data may be rewritten multiple times across levels.
RocksDB’s implementation of leveled compaction is described in detail in the RocksDB design doc. Empirically, a well‑tuned leveled setup yields a write amplification of ~2× for write‑heavy workloads.
Tiered Compaction (TC)
Tiered compaction groups SSTables into tiers of roughly equal size. When a tier accumulates enough files (e.g., 4–10), they are merged into a single larger file that moves to the next tier. Overwrites are eliminated only when the tier is compacted.
Pros
- Lower write amplification: each key is typically rewritten only once per tier.
- Better write throughput for bulk ingestion.
Cons
- Higher read amplification: a key may appear in many SSTables within the same tier.
- More storage overhead due to duplicate keys across tiers.
Apache Cassandra’s default compaction strategy, Size‑Tiered Compaction Strategy (STCS), follows this model. The official Cassandra documentation notes that STCS “optimizes for write throughput at the cost of read latency” (Cassandra architecture).
Hybrid Approaches
Many modern engines blend the two. Universal Compaction (used by RocksDB) merges files based on overlapping key ranges rather than strict levels, achieving a middle ground. Time‑Window Compaction (used by ScyllaDB) adds a temporal dimension, grouping data by ingestion time to improve TTL handling.
Tuning Compaction for Distributed Deployments
In a distributed database, each node runs its own LSM instance, but cluster‑wide considerations arise:
- Compaction throttling: Prevent background compaction from starving foreground writes. Use rate limits (e.g.,
rate_limit_mb_per_secin RocksDB) to keep CPU and I/O usage bounded. - Staggered schedules: Stagger compaction windows across nodes to avoid synchronized I/O spikes.
- Resource isolation: Run compaction on dedicated threads or cgroups to protect query latency.
A practical Bash snippet to monitor compaction progress in RocksDB:
#!/usr/bin/env bash
# Show ongoing compaction jobs and their throughput
curl -s http://localhost:8080/metrics | grep rocksdb_compaction
Tiered and Leveled Designs in Distributed Settings
When scaling out, the choice between leveled and tiered compaction influences not only a single node’s performance but also network traffic and consistency guarantees.
Replication and Write Amplification
In a quorum‑based system (e.g., Cassandra’s QUORUM writes), each write is sent to multiple replicas. If each replica uses tiered compaction, the total cluster‑wide write amplification roughly equals the per‑node amplification multiplied by the replication factor. Conversely, leveled compaction’s higher per‑node amplification may be offset by its lower read amplification, reducing cross‑node read traffic.
Hotspot Mitigation
Distributed workloads often exhibit hot keys (e.g., counters). LSM trees mitigate hotspots by sharding the key space across nodes and by using write‑back caches that absorb bursts before flushing. Some systems employ partition‑level compaction queues to prioritize less‑active partitions, preventing hot partitions from monopolizing I/O.
Consistency and Compaction Lag
Because compaction is asynchronous, a newly written value may still reside in a Level‑0 SSTable on one replica while already compacted on another. This can cause read‑repair operations: a read that discovers a stale version triggers a background repair to reconcile the replicas. Systems like ScyllaDB expose metrics such as read_repair_rate to help operators gauge the impact of compaction lag on consistency.
Operational Considerations and Trade‑offs
Running LSM‑based storage at scale requires careful monitoring and tuning.
Metrics to Watch
| Metric | Why It Matters |
|---|---|
memtable_flushes_total | High flush rate may indicate too small memtables, causing unnecessary SSTable proliferation. |
compaction_bytes_written | Direct proxy for write amplification; spikes suggest aggressive compaction. |
level0_sst_files_total | Excess Level‑0 files increase read amplification and can trigger compaction storms. |
wal_bytes_written | Helps assess durability overhead; large WAL sizes may require log rotation. |
These metrics are exposed by most engines via Prometheus endpoints. For example, RocksDB’s rocksdb_live_files metric reports the current number of SSTables.
Choosing Memtable Size
A larger memtable reduces flush frequency, lowering Level‑0 churn. However, it also increases recovery time after a crash because the WAL must be replayed for a larger in‑memory state. A typical rule of thumb is to allocate 10–20 % of total RAM to memtables, leaving the remainder for OS cache and query buffers.
SSD Endurance Planning
Even with reduced write amplification, high‑throughput workloads can still stress SSD endurance. Operators should:
- Enable TRIM – ensures the SSD can reclaim freed blocks after compaction deletes obsolete keys.
- Monitor
wear_leveling_count– many SSDs expose this via SMART. - Plan for over‑provisioning – allocating extra spare capacity prolongs device life.
Example: Tuning RocksDB for a 100 TB Cluster
# Set a 64 MiB memtable size, enable parallel compaction, and cap write rate
rocksdb-cli --set-option=write_buffer_size=67108864 \
--set-option=enable_parallel_compaction=true \
--set-option=rate_limit_mb_per_sec=200
This configuration balances write throughput with a target write amplification of ~1.5×, suitable for a mixed OLTP/OLAP workload.
Key Takeaways
- LSM trees convert random writes into sequential appends, drastically cutting write amplification compared to B‑tree engines.
- Compaction strategies (leveled, tiered, universal) are the primary knobs for controlling the trade‑off between write, read, and space amplification.
- In distributed databases, replication factor, hotspot patterns, and compaction scheduling jointly affect cluster‑wide amplification and consistency.
- Monitoring memtable flushes, Level‑0 file counts, and compaction throughput is essential for maintaining predictable performance.
- Proper SSD wear‑leveling practices and tuned memtable sizes ensure that reduced amplification translates into longer hardware lifespan and lower operational cost.