TL;DR — LSM trees enable massive write throughput by batching writes in immutable segments and merging them in the background; careful compaction strategy selection is essential to keep read latency low in distributed databases.
Write‑intensive workloads have become the norm for modern services such as telemetry ingestion, real‑time analytics, and event sourcing. Traditional B‑tree storage engines choke under the pressure of millions of small writes per second, while Log‑Structured Merge (LSM) trees thrive by turning random writes into sequential appends. This article walks through the internals of LSM trees, shows how they are wired into a distributed database stack, and provides concrete guidance for engineers who need to design, tune, and operate a high‑throughput write path.
What Is an LSM Tree?
Core Concepts
An LSM tree is a multi‑level, write‑optimized data structure that stores data in a series of sorted string tables (SSTables). New records are first written to an in‑memory structure called a memtable. When the memtable reaches a configurable size, it is frozen, written to disk as an immutable SSTable, and a new memtable takes its place. Over time, many SSTables accumulate; background compaction jobs merge overlapping files, rewrite them into larger, cleaner levels, and discard obsolete key‑value pairs.
Key ingredients:
| Component | Role |
|---|---|
| Write‑Ahead Log (WAL) | Guarantees durability before the memtable is persisted. |
| Memtable | Fast, mutable, sorted structure (often a skip list or a balanced tree). |
| SSTable | Immutable, sorted file with an index, optional bloom filter, and data blocks. |
| Levels | Logical groups that control file size and overlap. |
| Compaction | Periodic merging that reduces read amplification and reclaims space. |
Historical Background
The LSM idea dates back to 1996 when Patrick O’Neil and colleagues introduced it as a solution for high‑throughput storage systems. Open‑source implementations such as LevelDB (Google, 2011) and RocksDB (Facebook, 2012) popularized the design, and many distributed databases—Cassandra, ScyllaDB, HBase—adopted a variant of the LSM engine as their primary storage layer.
Write Path in Detail
MemTable and WAL
When a client issues a PUT(key, value) request, the database first appends a log record to the WAL. This guarantees that, in the event of a crash, the operation can be replayed. Simultaneously, the key/value pair is inserted into the memtable, which maintains entries sorted by key.
def write(key: bytes, value: bytes):
# 1️⃣ Append to WAL for durability
wal.append_record(key, value)
# 2️⃣ Insert into in‑memory sorted structure
memtable.insert(key, value)
# 3️⃣ Trigger flush if thresholds are exceeded
if memtable.size_bytes() >= MEMTABLE_LIMIT:
schedule_flush()
The WAL is typically a sequential file on a fast SSD or NVMe device, ensuring that the write latency is dominated by a single fsync‑level operation.
Immutable SSTables
When the memtable fills, the system performs a flush:
- The memtable is frozen and handed to a background thread.
- The frozen structure is iterated in key order.
- Data blocks are written to a new SSTable file (
.sst), together with a sparse index and a bloom filter. - The new file is placed in Level‑0, which allows overlapping files to simplify the flush logic.
Because SSTables are immutable, they can be safely read by any node in the cluster without acquiring locks, a property that dramatically simplifies replication.
Flush and Turnover
Flushing is asynchronous to keep the write path latency low. A typical configuration limits the number of concurrent flushes to avoid saturating the I/O subsystem.
# Example systemd unit that caps parallel flushes
systemctl set-property rocksdb-flush.service CPUQuota=50%
The flush pipeline also updates the manifest (a small JSON/YAML file) that tracks the current set of SSTables and their level assignments. This manifest is replicated across the cluster to keep every node’s view of the data layout consistent.
Read Path and Query Stitching
Bloom Filters
Each SSTable carries a bloom filter that quickly tells whether a key might be present in the file. During a read, the engine probes the bloom filters of Level‑0 files first; if the filter returns negative, the file is skipped, dramatically reducing disk I/O.
Level Searching
The read algorithm proceeds level by level:
- Level‑0: May contain overlapping files; the engine checks each candidate file in order of creation time.
- Higher Levels (1, 2, …): Files are non‑overlapping, so a binary search on the index locates the exact file that could contain the key.
- The engine reads the data block, validates the key, and returns the value or continues to the next level if the key is a tombstone (deletion marker).
Merged Reads
For range queries (SCAN(start, end)), the engine merges iterators from each level, akin to a k‑way merge. This process is efficient because each iterator yields keys in sorted order, and the merge can stop early once the desired range is exhausted.
Compaction Strategies
Compaction is the heart of LSM performance. Different strategies trade off write amplification, read amplification, and space amplification.
Leveled Compaction (LC)
- Structure: Each level
L_iholds files of sizeT * 2^i(whereTis a base size, e.g., 10 MiB). Files in a level never overlap. - Process: When Level‑0 accumulates enough files, they are merged into Level‑1, and so on.
- Pros: Low read amplification (at most one file per level). Predictable latency.
- Cons: Higher write amplification due to repeated rewrites across many levels.
Tiered Compaction (TC)
- Structure: Levels are groups of files that may overlap; compaction merges a tier of similarly sized files into a larger one.
- Pros: Lower write amplification, good for write‑heavy workloads.
- Cons: Higher read amplification because multiple overlapping files may need to be consulted.
Universal Compaction (UC)
- Structure: Files are compacted based on age and size thresholds, regardless of level.
- Pros: Excellent write throughput for append‑only workloads; minimal configuration.
- Cons: Can lead to high space amplification if deletions are frequent.
Choosing the Right Strategy
| Workload | Desired Metric | Recommended Compaction |
|---|---|---|
| High write, moderate reads | Low write amplification | Tiered |
| Write‑heavy with latency‑sensitive reads | Predictable read latency | Leveled |
| Mostly inserts, few deletes | Max throughput | Universal |
| Mixed OLTP/OLAP | Balanced | Hybrid (Leveled + Tiered) |
Most production systems expose this choice via a configuration flag, e.g., rocksdb.compaction_style=leveled.
Distributed Database Integration
Sharding and Partitioning
In a distributed setting, the LSM engine lives per shard (or per node). The cluster’s partitioning scheme (hash‑based, range‑based, or token‑aware) determines which keys map to which LSM instance. For range‑sharded systems, each node can serve a contiguous key space, simplifying compaction coordination.
Replication and Consistency
Replication is usually log‑based: the primary node writes to its WAL, then streams the same log entries to replicas. Because SSTables are immutable, replicas can safely apply the same writes in the same order without conflict. Consistency models (strong, eventual) are enforced at the request‑routing layer, not within the LSM engine itself.
Coordinated Compaction
Compaction can be local (each node compacts its own files) or coordinated (a cluster controller schedules compactions to avoid I/O spikes across the fleet). Coordinated compaction is especially important for cloud deployments where network‑attached storage may become a bottleneck.
# Example compaction coordinator policy (pseudo‑YAML)
compaction:
policy: "balanced"
max_concurrent_per_node: 2
global_rate_limit_mb: 500
node_weight:
high_cpu: 0.7
high_iops: 0.3
Practical Implementation Checklist
- Durability
- Enable WAL with
fsyncon every commit or batch of commits. - Replicate WAL entries to at least one follower.
- Enable WAL with
- Memory Management
- Size the memtable to fit in available RAM (typically 10–30 % of total RAM).
- Allocate separate pools for indexes and data blocks to avoid fragmentation.
- File Layout
- Place Level‑0 SSTables on fast NVMe media.
- Store higher‑level SSTables on cheaper, higher‑capacity SSDs or even HDDs if latency tolerable.
- Compaction Tuning
- Choose compaction style based on workload (see table above).
- Set
max_background_compactionsto avoid CPU starvation.
- Monitoring
- Track write amplification (
bytes_written / bytes_ingested). - Observe read amplification (
files_per_read). - Alert on stalled flushes or long compaction queues.
- Track write amplification (
- Testing
- Run YCSB or a custom benchmark that simulates peak write rates.
- Verify correctness after abrupt crashes (replay WAL, compare checksums).
Performance Tuning
Write Buffer Sizing
Increasing the memtable size reduces the frequency of flushes, which lowers write amplification. However, larger buffers increase recovery time after a crash because more data resides only in the WAL.
# RocksDB example: set 256 MiB memtable size
rocksdb --write_buffer_size=268435456
Parallel Compaction
Modern CPUs can handle multiple compactions in parallel. Set max_background_compactions to a value that matches the number of physical cores minus one (to leave room for request handling).
# Enable 8 parallel compactions on a 16‑core machine
rocksdb --max_background_compactions=8
Hardware Considerations
| Component | Recommendation |
|---|---|
| CPU | High single‑thread performance for WAL appends; many cores for parallel compaction. |
| RAM | At least 2× the total memtable size to accommodate OS page cache. |
| Storage | NVMe for Level‑0, high‑throughput SSD for Levels 1‑3; avoid HDD for hot data. |
| Network | Low‑latency, high‑throughput links for log replication (e.g., 10 GbE or better). |
Example Configuration Snippet (RocksDB)
# rocksdb.conf
db_path: "/var/lib/mydb"
wal_dir: "/var/lib/mydb/wal"
max_open_files: 1000
write_buffer_size: 256MiB
max_write_buffer_number: 4
target_file_size_base: 64MiB
level0_file_num_compaction_trigger: 8
max_background_flushes: 2
max_background_compactions: 8
compaction_style: "leveled"
Key Takeaways
- LSM trees transform random writes into sequential appends, delivering orders‑of‑magnitude higher write throughput than B‑tree engines.
- The write path consists of WAL → memtable → immutable SSTable; each step is designed for durability and lock‑free reads.
- Compaction strategy (leveled, tiered, universal) is the primary lever for balancing write amplification, read amplification, and space usage.
- In a distributed database, LSM engines are scoped per shard; replication streams the same WAL entries, while coordinated compaction prevents cluster‑wide I/O spikes.
- Proper sizing of memtables, background compaction threads, and storage tiering is essential for predictable latency under heavy load.
- Continuous monitoring of write/read amplification and compaction queues is crucial for early detection of performance degradation.
Further Reading
- LevelDB Design Document – original LSM implementation and design rationale.
- RocksDB Documentation – detailed guide on configuration, compaction styles, and performance tuning.
- Apache Cassandra Architecture Overview – how Cassandra integrates LSM trees at scale.
- The LSM‑Tree Paper (O’Neil, 1996) – foundational academic description.
- ScyllaDB – How Compaction Works – practical insights into coordinated compaction in a distributed setting.