Implementing Log-Structured Merge Trees for High‑Throughput Write‑Intensive Distributed Databases

TL;DR — LSM trees enable massive write throughput by batching writes in immutable segments and merging them in the background; careful compaction strategy selection is essential to keep read latency low in distributed databases.

Write‑intensive workloads have become the norm for modern services such as telemetry ingestion, real‑time analytics, and event sourcing. Traditional B‑tree storage engines choke under the pressure of millions of small writes per second, while Log‑Structured Merge (LSM) trees thrive by turning random writes into sequential appends. This article walks through the internals of LSM trees, shows how they are wired into a distributed database stack, and provides concrete guidance for engineers who need to design, tune, and operate a high‑throughput write path.

What Is an LSM Tree?

Core Concepts

An LSM tree is a multi‑level, write‑optimized data structure that stores data in a series of sorted string tables (SSTables). New records are first written to an in‑memory structure called a memtable. When the memtable reaches a configurable size, it is frozen, written to disk as an immutable SSTable, and a new memtable takes its place. Over time, many SSTables accumulate; background compaction jobs merge overlapping files, rewrite them into larger, cleaner levels, and discard obsolete key‑value pairs.

Key ingredients:

Component	Role
Write‑Ahead Log (WAL)	Guarantees durability before the memtable is persisted.
Memtable	Fast, mutable, sorted structure (often a skip list or a balanced tree).
SSTable	Immutable, sorted file with an index, optional bloom filter, and data blocks.
Levels	Logical groups that control file size and overlap.
Compaction	Periodic merging that reduces read amplification and reclaims space.

Historical Background

The LSM idea dates back to 1996 when Patrick O’Neil and colleagues introduced it as a solution for high‑throughput storage systems. Open‑source implementations such as LevelDB (Google, 2011) and RocksDB (Facebook, 2012) popularized the design, and many distributed databases—Cassandra, ScyllaDB, HBase—adopted a variant of the LSM engine as their primary storage layer.

Write Path in Detail

MemTable and WAL

When a client issues a PUT(key, value) request, the database first appends a log record to the WAL. This guarantees that, in the event of a crash, the operation can be replayed. Simultaneously, the key/value pair is inserted into the memtable, which maintains entries sorted by key.

def write(key: bytes, value: bytes):
    # 1️⃣ Append to WAL for durability
    wal.append_record(key, value)
    # 2️⃣ Insert into in‑memory sorted structure
    memtable.insert(key, value)
    # 3️⃣ Trigger flush if thresholds are exceeded
    if memtable.size_bytes() >= MEMTABLE_LIMIT:
        schedule_flush()

The WAL is typically a sequential file on a fast SSD or NVMe device, ensuring that the write latency is dominated by a single fsync‑level operation.

Immutable SSTables

When the memtable fills, the system performs a flush:

The memtable is frozen and handed to a background thread.
The frozen structure is iterated in key order.
Data blocks are written to a new SSTable file (.sst), together with a sparse index and a bloom filter.
The new file is placed in Level‑0, which allows overlapping files to simplify the flush logic.

Because SSTables are immutable, they can be safely read by any node in the cluster without acquiring locks, a property that dramatically simplifies replication.

Flush and Turnover

Flushing is asynchronous to keep the write path latency low. A typical configuration limits the number of concurrent flushes to avoid saturating the I/O subsystem.

# Example systemd unit that caps parallel flushes
systemctl set-property rocksdb-flush.service CPUQuota=50%

The flush pipeline also updates the manifest (a small JSON/YAML file) that tracks the current set of SSTables and their level assignments. This manifest is replicated across the cluster to keep every node’s view of the data layout consistent.

Read Path and Query Stitching

Bloom Filters

Each SSTable carries a bloom filter that quickly tells whether a key might be present in the file. During a read, the engine probes the bloom filters of Level‑0 files first; if the filter returns negative, the file is skipped, dramatically reducing disk I/O.

Level Searching

The read algorithm proceeds level by level:

Level‑0: May contain overlapping files; the engine checks each candidate file in order of creation time.
Higher Levels (1, 2, …): Files are non‑overlapping, so a binary search on the index locates the exact file that could contain the key.
The engine reads the data block, validates the key, and returns the value or continues to the next level if the key is a tombstone (deletion marker).

Merged Reads

For range queries (SCAN(start, end)), the engine merges iterators from each level, akin to a k‑way merge. This process is efficient because each iterator yields keys in sorted order, and the merge can stop early once the desired range is exhausted.

Compaction Strategies

Compaction is the heart of LSM performance. Different strategies trade off write amplification, read amplification, and space amplification.

Leveled Compaction (LC)

Structure: Each level L_i holds files of size T * 2^i (where T is a base size, e.g., 10 MiB). Files in a level never overlap.
Process: When Level‑0 accumulates enough files, they are merged into Level‑1, and so on.
Pros: Low read amplification (at most one file per level). Predictable latency.
Cons: Higher write amplification due to repeated rewrites across many levels.

Tiered Compaction (TC)

Structure: Levels are groups of files that may overlap; compaction merges a tier of similarly sized files into a larger one.
Pros: Lower write amplification, good for write‑heavy workloads.
Cons: Higher read amplification because multiple overlapping files may need to be consulted.

Universal Compaction (UC)

Structure: Files are compacted based on age and size thresholds, regardless of level.
Pros: Excellent write throughput for append‑only workloads; minimal configuration.
Cons: Can lead to high space amplification if deletions are frequent.

Choosing the Right Strategy

Workload	Desired Metric	Recommended Compaction
High write, moderate reads	Low write amplification	Tiered
Write‑heavy with latency‑sensitive reads	Predictable read latency	Leveled
Mostly inserts, few deletes	Max throughput	Universal
Mixed OLTP/OLAP	Balanced	Hybrid (Leveled + Tiered)

Most production systems expose this choice via a configuration flag, e.g., rocksdb.compaction_style=leveled.

Distributed Database Integration

Sharding and Partitioning

In a distributed setting, the LSM engine lives per shard (or per node). The cluster’s partitioning scheme (hash‑based, range‑based, or token‑aware) determines which keys map to which LSM instance. For range‑sharded systems, each node can serve a contiguous key space, simplifying compaction coordination.

Replication and Consistency

Replication is usually log‑based: the primary node writes to its WAL, then streams the same log entries to replicas. Because SSTables are immutable, replicas can safely apply the same writes in the same order without conflict. Consistency models (strong, eventual) are enforced at the request‑routing layer, not within the LSM engine itself.

Coordinated Compaction

Compaction can be local (each node compacts its own files) or coordinated (a cluster controller schedules compactions to avoid I/O spikes across the fleet). Coordinated compaction is especially important for cloud deployments where network‑attached storage may become a bottleneck.

# Example compaction coordinator policy (pseudo‑YAML)
compaction:
  policy: "balanced"
  max_concurrent_per_node: 2
  global_rate_limit_mb: 500
  node_weight:
    high_cpu: 0.7
    high_iops: 0.3

Practical Implementation Checklist

Durability
- Enable WAL with fsync on every commit or batch of commits.
- Replicate WAL entries to at least one follower.
Memory Management
- Size the memtable to fit in available RAM (typically 10–30 % of total RAM).
- Allocate separate pools for indexes and data blocks to avoid fragmentation.
File Layout
- Place Level‑0 SSTables on fast NVMe media.
- Store higher‑level SSTables on cheaper, higher‑capacity SSDs or even HDDs if latency tolerable.
Compaction Tuning
- Choose compaction style based on workload (see table above).
- Set max_background_compactions to avoid CPU starvation.
Monitoring
- Track write amplification (bytes_written / bytes_ingested).
- Observe read amplification (files_per_read).
- Alert on stalled flushes or long compaction queues.
Testing
- Run YCSB or a custom benchmark that simulates peak write rates.
- Verify correctness after abrupt crashes (replay WAL, compare checksums).

Performance Tuning

Write Buffer Sizing

Increasing the memtable size reduces the frequency of flushes, which lowers write amplification. However, larger buffers increase recovery time after a crash because more data resides only in the WAL.

# RocksDB example: set 256 MiB memtable size
rocksdb --write_buffer_size=268435456

Parallel Compaction

Modern CPUs can handle multiple compactions in parallel. Set max_background_compactions to a value that matches the number of physical cores minus one (to leave room for request handling).

# Enable 8 parallel compactions on a 16‑core machine
rocksdb --max_background_compactions=8

Hardware Considerations

Component	Recommendation
CPU	High single‑thread performance for WAL appends; many cores for parallel compaction.
RAM	At least 2× the total memtable size to accommodate OS page cache.
Storage	NVMe for Level‑0, high‑throughput SSD for Levels 1‑3; avoid HDD for hot data.
Network	Low‑latency, high‑throughput links for log replication (e.g., 10 GbE or better).

Example Configuration Snippet (RocksDB)

# rocksdb.conf
db_path: "/var/lib/mydb"
wal_dir: "/var/lib/mydb/wal"
max_open_files: 1000
write_buffer_size: 256MiB
max_write_buffer_number: 4
target_file_size_base: 64MiB
level0_file_num_compaction_trigger: 8
max_background_flushes: 2
max_background_compactions: 8
compaction_style: "leveled"

Key Takeaways

LSM trees transform random writes into sequential appends, delivering orders‑of‑magnitude higher write throughput than B‑tree engines.
The write path consists of WAL → memtable → immutable SSTable; each step is designed for durability and lock‑free reads.
Compaction strategy (leveled, tiered, universal) is the primary lever for balancing write amplification, read amplification, and space usage.
In a distributed database, LSM engines are scoped per shard; replication streams the same WAL entries, while coordinated compaction prevents cluster‑wide I/O spikes.
Proper sizing of memtables, background compaction threads, and storage tiering is essential for predictable latency under heavy load.
Continuous monitoring of write/read amplification and compaction queues is crucial for early detection of performance degradation.

What Is an LSM Tree?#

Core Concepts#

Historical Background#

Write Path in Detail#

MemTable and WAL#

Immutable SSTables#

Flush and Turnover#

Read Path and Query Stitching#

Bloom Filters#

Level Searching#

Merged Reads#

Compaction Strategies#

Leveled Compaction (LC)#

Tiered Compaction (TC)#

Universal Compaction (UC)#

Choosing the Right Strategy#

Distributed Database Integration#

Sharding and Partitioning#

Replication and Consistency#

Coordinated Compaction#

Practical Implementation Checklist#

Performance Tuning#

Write Buffer Sizing#

Parallel Compaction#

Hardware Considerations#

Example Configuration Snippet (RocksDB)#

Key Takeaways#

Further Reading#