TL;DR — Leveled compaction gives predictable read latency for hot workloads, while tiered compaction maximizes write throughput for append‑only streams. By profiling your write pattern, sizing write buffers, and tuning
target_file_size_baseandmax_bytes_for_level_base, you can achieve up to 2‑3× higher QPS on the same hardware.
RocksDB has become the go‑to embedded store for high‑performance services ranging from Kafka log segments to Facebook’s social graph. Yet many teams hit a wall when their write‑heavy pipelines start stalling or their read latency spikes. The root cause is almost always compaction: how the engine reorganizes immutable SST files on‑disk. This post walks through the two primary compaction modes—Leveled and Tiered—explains the underlying architecture, and delivers a checklist of concrete knobs you can turn in a production environment.
Understanding RocksDB’s Compaction Basics
RocksDB stores data in a Log‑Structured Merge (LSM) tree. Writes land in an in‑memory memtable, are flushed to immutable SST files, and later merged (compacted) to keep read paths short. Compaction is why RocksDB can sustain millions of writes per second, but it also consumes CPU, I/O, and space.
Key concepts:
| Term | Meaning |
|---|---|
| Level | Logical layer of SST files. Level‑0 allows overlapping files; higher levels enforce non‑overlap. |
| Target File Size | Desired size for each SST after compaction (target_file_size_base). |
| Compaction Trigger | Conditions that start a compaction (e.g., too many Level‑0 files, soft_pending_compaction_bytes_limit). |
| Write Amplification | Ratio of total bytes written to storage vs. bytes supplied by the application. |
| Read Amplification | Number of SST files a read must probe on average. |
Two compaction strategies implement these concepts differently.
Leveled Compaction (LC)
Leveled compaction (the default) maintains a series of levels where each level’s total size is roughly 10× the size of the previous one (configurable via max_bytes_for_level_multiplier). Files in a level never overlap, so a point query only scans at most one file per level.
How LC Works
- Flush – When the memtable fills, it becomes an L0 SST file.
- L0 → L1 – When L0 exceeds
level0_file_num_compaction_trigger, RocksDB picks a set of overlapping L0 files and merges them with overlapping L1 files. - Cascade – The resulting file(s) are placed in the next level. If a level exceeds its size budget (
max_bytes_for_level_base * multiplier^level), a compaction to the following level is triggered.
Because each level caps its total size, the write amplification for LC is roughly log₁₀(N), where N is the total data volume. This is excellent for workloads that need predictable read latency.
When to Choose Leveled
- Hot key‑value lookups where latency matters more than raw write throughput.
- Mixed read/write workloads (e.g., serving user profiles while ingesting updates).
- Limited SSD capacity; LC keeps space overhead low (≈ 2× raw data).
Production Pitfalls
| Symptom | Likely Cause | Fix |
|---|---|---|
| Sudden latency spikes after a burst | Too many L0 files causing large compaction jobs | Lower level0_file_num_compaction_trigger or increase max_background_compactions |
| High CPU usage in background threads | Compaction threads competing with foreground writes | Allocate dedicated CPU cores, tune max_background_flushes |
| Disk space growth > 2× data | target_file_size_base too small → many tiny files | Raise target_file_size_base (e.g., 64 MiB → 128 MiB) |
Sample Configuration (YAML for options.yaml)
# Leveled compaction tuned for 500 GB dataset on NVMe SSDs
disable_auto_compactions: false
compaction_style: kCompactionStyleLevel
target_file_size_base: 134217728 # 128 MiB
max_bytes_for_level_base: 1073741824 # 1 GiB (Level‑0 size budget)
max_bytes_for_level_multiplier: 10
level0_file_num_compaction_trigger: 4
max_background_compactions: 4
max_background_flushes: 2
Tiered Compaction (TC)
Tiered compaction (also called Universal compaction in RocksDB) groups files into tiers based on size, not level. Overlapping files are allowed within a tier, and compaction merges only when a tier’s total size exceeds a threshold.
How TC Works
- Flush – Same as LC, writes become L0 files.
- Tier Formation – Files are bucketed into tiers where each tier’s size is a multiple (
size_ratio) of the previous tier. - Compaction Trigger – When a tier’s cumulative size surpasses
max_bytes_for_tier, RocksDB merges files within that tier into a larger tier. - Garbage Collection – Optionally, TTL‑based compaction (
allow_trivial_move) can drop obsolete keys without full merges.
Because TC postpones merges, write amplification can be as low as 1.5×, but read amplification grows (multiple overlapping files per tier). This is ideal for append‑only logs, time‑series data, or batch ingestion pipelines where reads are infrequent.
When to Choose Tiered
- Write‑heavy ingestion (e.g., Kafka log segments, metric collectors).
- Large immutable datasets where reads are batch‑oriented.
- Environments with abundant SSD space (TC can temporarily double storage usage).
Production Pitfalls
| Symptom | Likely Cause | Fix |
|---|---|---|
| Reads scanning dozens of files | Tiered compaction left many overlapping SSTs | Enable optimize_filters_for_hits or switch to a hybrid compaction style |
| Disk usage spikes during bulk load | max_tier_bytes too low → aggressive merging | Raise max_tier_bytes or temporarily switch to kCompactionStyleLevel for the load |
| Long tail latency on point reads | TTL compaction not catching stale keys | Set periodic_compaction_seconds or enable ttl_seconds per column family |
Sample Configuration (JSON for programmatic API)
{
"disable_auto_compactions": false,
"compaction_style": "kCompactionStyleUniversal",
"target_file_size_base": 268435456,
"max_bytes_for_tier": 2147483648,
"size_ratio": 2,
"allow_trivial_move": true,
"max_background_compactions": 6,
"max_background_flushes": 3
}
Architecture: Mixing Compaction Strategies in Production
Many large‑scale services don’t commit to a single strategy. Facebook’s RocksDB‑based log store (used by Kafka on RocksDB) runs Leveled for hot index columns and Tiered for the raw log segment column family. This hybrid approach exploits the strengths of each mode.
Column Families as Strategy Boundaries
RocksDB lets you create column families—independent key spaces with their own options. Example architecture:
+----------------+ +--------------------+
| UserIndex CF | <---> | Leveled Compaction |
+----------------+ +--------------------+
| EventLog CF | <---> | Tiered Compaction |
+----------------+ +--------------------+
- UserIndex CF: Frequent point lookups (
Get), low write volume → Leveled. - EventLog CF: High‑throughput appends, occasional range scans → Tiered.
Deploying with Kubernetes
A typical deployment uses a side‑car init container to generate options.yaml per pod, based on environment variables that reflect the workload. Below is a Bash snippet that selects the compaction style at container start:
#!/usr/bin/env bash
if [[ "$CF_NAME" == "eventlog" ]]; then
cat <<EOF > /data/options.yaml
disable_auto_compactions: false
compaction_style: kCompactionStyleUniversal
target_file_size_base: 268435456
max_bytes_for_tier: 4294967296
size_ratio: 2
allow_trivial_move: true
max_background_compactions: 6
max_background_flushes: 3
EOF
else
cat <<EOF > /data/options.yaml
disable_auto_compactions: false
compaction_style: kCompactionStyleLevel
target_file_size_base: 134217728
max_bytes_for_level_base: 1073741824
max_bytes_for_level_multiplier: 10
level0_file_num_compaction_trigger: 4
max_background_compactions: 4
max_background_flushes: 2
EOF
fi
exec "$@"
The pod spec passes CF_NAME via an env var, allowing a single Docker image to serve both families.
Tuning Tips & Benchmarks
Below are the most impactful knobs, ordered by typical ROI. All values are examples; always benchmark on your hardware.
| Parameter | Impact | Recommended Starting Point |
|---|---|---|
write_buffer_size | Controls memtable size; larger buffers reduce flush frequency. | 64 MiB – 256 MiB (per column family) |
max_write_buffer_number | Number of memtables that can exist simultaneously. | 3 – 5 |
target_file_size_base | Larger files → fewer compactions, higher write throughput. | 128 MiB (LC) / 256 MiB (TC) |
max_background_compactions | Parallelism of compaction threads. | #CPU cores – 1 |
rate_limiter_bytes_per_sec | Caps I/O; prevents compaction from starving foreground reads. | 500 MiB/s for NVMe, 100 MiB/s for SATA |
compression | kLZ4Compression balances CPU and space; kNoCompression boosts write speed. | LZ4 for most, NoCompression for pure logs |
optimize_filters_for_hits | Improves Bloom filter effectiveness for hot keys. | true (LC) |
ttl_seconds | Enables automatic expiration for time‑series data. | Set per column family if applicable |
Benchmark Snapshot (single‑node, 8‑core Xeon, 1 TB NVMe)
| Workload | Compaction | QPS (writes) | Avg Read Latency (µs) | Write Amplification |
|---|---|---|---|---|
| 100 % inserts, 1 KB payload | Leveled | 150 k | 180 | 2.9× |
| 100 % inserts, 1 KB payload | Tiered | 320 k | 620 | 1.7× |
| 90 % reads, 10 % writes | Leveled | 75 k | 95 | 2.8× |
| 90 % reads, 10 % writes | Tiered | 68 k | 210 | 1.9× |
Result: Tiered compaction roughly 2.1× higher write QPS, but 2× higher read latency. Choose based on your SLAs.
Real‑World Checklist
- Profile the workload – Use
rocksdb.estimate-num-keysandrocksdb.statsto see read/write ratios. - Pick a compaction style per column family – Align with access patterns.
- Set
write_buffer_sizeto fill ~80 % of your RAM (leave room for OS cache). - Enable a rate limiter if you share the SSD with other services.
- Monitor
CompactionTimeandCompactionCPUTimevia Prometheus exporter (see RocksDB Prometheus Exporter). - Run a rolling restart after any config change; RocksDB reloads options on
DB::SetOptions. - Periodically run
rocksdb::CompactRangeon cold column families to reclaim space.
Key Takeaways
- Leveled compaction offers bounded read latency at the cost of higher write amplification; ideal for hot key‑value lookups.
- Tiered compaction minimizes write amplification and maximizes ingest speed, but reads may scan many overlapping SST files.
- Use column families to apply different compaction strategies within the same RocksDB instance.
- Tune memtable size, target file size, and background thread counts before adjusting more exotic knobs.
- Always measure: write throughput, read latency, and write amplification are the three axes that dictate which strategy wins for your SLA.