TL;DR — For write‑intensive distributed databases, the secret to low latency and high throughput lies in fine‑tuning LSM‑tree compaction policies, write‑ahead logging, and real‑time health metrics. A small set of production patterns—tiered compaction, adaptive level sizing, and automated back‑pressure—delivers measurable gains without sacrificing durability.
Write‑heavy workloads are the raison d’être of Log‑Structured Merge (LSM) trees, yet many teams treat them as a black box: “turn on RocksDB, it works.” In practice, the default settings in Cassandra, ScyllaDB, or HBase often lead to compaction storms, write amplification, and unpredictable latency. This article unpacks the anatomy of an LSM tree, maps it onto modern distributed architectures, and distills production‑grade patterns that keep the write path humming at millions of ops per second.
LSM Tree Fundamentals
An LSM tree replaces random‑write‑heavy on‑disk structures (B‑trees) with a series of sorted runs that are first written to a write‑ahead log (WAL) and then flushed to immutable files (SSTables) on disk. Over time, these runs are merged (compacted) into larger levels, reducing read amplification at the cost of write amplification.
| Component | Role | Typical Size |
|---|---|---|
| WAL | Guarantees durability before data hits the memtable | 64 KB – 1 MiB |
| Memtable | In‑memory sorted structure (often a skip list) | 64 MiB – 256 MiB |
| SSTable | Immutable sorted file on disk | 64 MiB – 1 GiB |
| Levels | Hierarchical groups of SSTables with size ratios (e.g., 10×) | Exponential growth |
The two main compaction strategies—Tiered and Leveled—are illustrated in the diagram below.
Tiered Compaction Leveled Compaction
┌───────┐ ┌───────┐
│ Mem │ → Flush → L0 │ Mem │ → Flush → L0
└───────┘ └───────┘
│ │
▼ ▼
┌───────┐ Merge ┌───────┐ Merge ┌───────┐
│ L0 A │ ───────► │ L1 A │ ───────► │ L2 A │
└───────┘ └───────┘ └───────┘
Tiered groups files of similar size and merges them in batches, while Leveled maintains a single‑file per level invariant, guaranteeing bounded read amplification but often increasing write amplification.
Architecture in Distributed Databases
Partitioning and Replication
Distributed systems shard data by partition key, assigning each partition to a replica set (e.g., three nodes). Each node runs an independent LSM instance, but coordination happens at two levels:
- Gossip‑based membership – ensures every node knows the current topology (e.g., Cassandra’s Gossip protocol).
- Write coordination – the client driver sends the write to a coordinator node, which forwards to the required replicas using a quorum policy.
Because each replica writes to its own WAL and memtable, write amplification multiplies by the replication factor. Therefore, any compaction optimization on a single node yields system‑wide latency benefits.
“Running a compaction on a single node while the other replicas idle can create a temporary hotspot that hurts quorum latency.” – Cassandra Architecture Guide
Tiered vs Leveled Compaction in Production
Both strategies have trade‑offs:
| Metric | Tiered | Leveled |
|---|---|---|
| Write Amplification | Low (≈ 2×) | High (≈ 10×) |
| Read Amplification | Higher (≈ 10×) | Low (≈ 1–2×) |
| Space Overhead | Moderate | Up to 2× data size |
| Compaction Frequency | Sporadic, large merges | Frequent, small merges |
In a write‑intensive scenario (e.g., telemetry ingestion at > 1 M ops/s), tiered compaction is usually the default because it limits write amplification. However, tiered compaction can produce large merges that spike CPU and I/O. The production pattern is to mix: use tiered for hot data and switch to leveled for warm data after a configurable age.
Example: RocksDB options for a mixed strategy
# rocksdb_options.yaml
disable_auto_compactions: false
# Tiered for the first two levels
level0_file_num_compaction_trigger: 4
level_compaction_dynamic_level_bytes: true
target_file_size_base: 67108864 # 64 MiB
max_bytes_for_level_base: 268435456 # 256 MiB
max_bytes_for_level_multiplier: 10
# Switch to leveled after level 2
level0_stop_writes_trigger: 12
The level_compaction_dynamic_level_bytes flag tells RocksDB to grow each level by the max_bytes_for_level_multiplier, effectively creating a hybrid tiered‑to‑leveled progression.
Performance Engineering
Write Path Optimizations
- Batching at the client – Group multiple mutations into a single RPC. Most drivers (e.g., DataStax Java driver) support
BatchStatement. Batching reduces per‑request overhead and lets the coordinator issue a singleINSERTper replica. - WAL buffering – Increase the WAL segment size to reduce fsync frequency. On Linux,
fsynccost dominates small‑batch writes. For Cassandra, setcommitlog_sync_batch_window_in_msto 10 ms. - Memtable sizing – Larger memtables delay flushes, cutting write amplification. Empirically, a 256 MiB memtable yields ~15 % lower total write I/O for a 10 TB dataset. Adjust per-node RAM:
memtable_cleanup_thresholdandmemtable_total_space_in_mb. - Compression choice – LZ4 offers low CPU overhead; ZSTD gives higher compression at modest latency cost. In a latency‑sensitive pipeline, configure
compression = LZ4for hot tables andcompression = ZSTDfor archival tables.
# Example Cassandra yaml snippet
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 10
memtable_total_space_in_mb: 4096
Read Path Trade‑offs
Even in a write‑heavy system, reads matter for consistency checks and admin queries. Two levers control read latency:
| Lever | Effect | Recommended Setting |
|---|---|---|
| Bloom filter size | Reduces false positives when scanning SSTables | bloom_filter_bits_per_key: 10 |
| Row cache | Serves hot rows from memory, bypasses disk | Enable on tables with > 90 % hit rate |
| Parallel read | Issues concurrent reads across multiple SSTables | concurrent_reads: 64 (tuned per‑CPU) |
A practical rule of thumb from the ScyllaDB performance guide: allocate 30 % of RAM to the row cache only if the cache hit ratio exceeds 80 % over a 5‑minute window. Otherwise, the memory is better spent on larger memtables.
Patterns in Production
Monitoring Compaction Health
Compaction can become a silent killer. The following metrics are essential:
CompactionPendingTasks– number of merges waiting in the queue. A sudden spike indicates back‑pressure.WriteAmplificationFactor– ratio of bytes written to disk vs bytes ingested. Target ≤ 3 for tiered, ≤ 10 for leveled.CPUUtilizationof compaction threads – keep under 70 % to leave headroom for writes.
Grafana dashboards that overlay these metrics with request latency surface correlations quickly. Example PromQL query for pending tasks:
sum by (instance) (cassandra_compaction_pending_tasks)
Automated Tier Management
A common production pattern is dynamic tier promotion:
- Hot tier (tiered compaction, 0‑2 days old) – low write amplification, high ingest throughput.
- Warm tier (leveled compaction, 2‑30 days) – balanced read/write cost.
- Cold tier (archival, compaction disabled, heavy compression) – minimal storage cost.
Automation can be driven by a simple cron job that inspects SSTable timestamps and flips the compaction_strategy_class per table. Below is a Python snippet using the Cassandra driver:
from cassandra.cluster import Cluster
cluster = Cluster(['10.0.0.1'])
session = cluster.connect()
def promote_table(keyspace, table):
# Switch to LeveledCompactionStrategy after 2 days
stmt = f"""
ALTER TABLE {keyspace}.{table}
WITH compaction = {{
'class': 'LeveledCompactionStrategy',
'enabled': 'true',
'sstable_age_in_days': '2'
}};
"""
session.execute(stmt)
# Example usage
promote_table('telemetry', 'events')
The job logs changes and alerts if the sstable_age_in_days threshold is not met within a configurable window.
Back‑Pressure and Flow Control
When compaction consumes > 80 % of I/O, writes start queuing in the memtable. Modern drivers expose write throttling callbacks. In Java, the DataStax driver lets you register a RequestThrottler:
Cluster.builder()
.addContactPoint("10.0.0.1")
.withRequestThrottler(new ConstantThrottler(5000)) // 5k req/s max
.build();
Coupled with a circuit breaker that watches CompactionPendingTasks, the system can automatically reduce client write rates, preventing tail‑latency spikes.
Key Takeaways
- Tiered compaction minimizes write amplification for hot data; switch to leveled for warm data to keep reads fast.
- Size memtables and WAL segments to match available RAM and SSD write latency; a 256 MiB memtable often yields 10‑15 % lower I/O.
- Monitor
CompactionPendingTasks,WriteAmplificationFactor, and CPU usage; set alerts at 70 % CPU and 100 pending tasks. - Automate tier promotion with scripts that adjust the compaction strategy based on SSTable age, keeping the system self‑balancing.
- Apply client‑side batching and driver‑level throttling to smooth write bursts and avoid compaction‑induced back‑pressure.
