TL;DR — Leveled compaction gives predictable read latency for hot workloads, while tiered compaction maximizes write throughput for append‑only streams. By profiling your write pattern, sizing write buffers, and tuning target_file_size_base and max_bytes_for_level_base, you can achieve up to 2‑3× higher QPS on the same hardware.

RocksDB has become the go‑to embedded store for high‑performance services ranging from Kafka log segments to Facebook’s social graph. Yet many teams hit a wall when their write‑heavy pipelines start stalling or their read latency spikes. The root cause is almost always compaction: how the engine reorganizes immutable SST files on‑disk. This post walks through the two primary compaction modes—Leveled and Tiered—explains the underlying architecture, and delivers a checklist of concrete knobs you can turn in a production environment.

Understanding RocksDB’s Compaction Basics

RocksDB stores data in a Log‑Structured Merge (LSM) tree. Writes land in an in‑memory memtable, are flushed to immutable SST files, and later merged (compacted) to keep read paths short. Compaction is why RocksDB can sustain millions of writes per second, but it also consumes CPU, I/O, and space.

Key concepts:

TermMeaning
LevelLogical layer of SST files. Level‑0 allows overlapping files; higher levels enforce non‑overlap.
Target File SizeDesired size for each SST after compaction (target_file_size_base).
Compaction TriggerConditions that start a compaction (e.g., too many Level‑0 files, soft_pending_compaction_bytes_limit).
Write AmplificationRatio of total bytes written to storage vs. bytes supplied by the application.
Read AmplificationNumber of SST files a read must probe on average.

Two compaction strategies implement these concepts differently.

Leveled Compaction (LC)

Leveled compaction (the default) maintains a series of levels where each level’s total size is roughly 10× the size of the previous one (configurable via max_bytes_for_level_multiplier). Files in a level never overlap, so a point query only scans at most one file per level.

How LC Works

  1. Flush – When the memtable fills, it becomes an L0 SST file.
  2. L0 → L1 – When L0 exceeds level0_file_num_compaction_trigger, RocksDB picks a set of overlapping L0 files and merges them with overlapping L1 files.
  3. Cascade – The resulting file(s) are placed in the next level. If a level exceeds its size budget (max_bytes_for_level_base * multiplier^level), a compaction to the following level is triggered.

Because each level caps its total size, the write amplification for LC is roughly log₁₀(N), where N is the total data volume. This is excellent for workloads that need predictable read latency.

When to Choose Leveled

  • Hot key‑value lookups where latency matters more than raw write throughput.
  • Mixed read/write workloads (e.g., serving user profiles while ingesting updates).
  • Limited SSD capacity; LC keeps space overhead low (≈ 2× raw data).

Production Pitfalls

SymptomLikely CauseFix
Sudden latency spikes after a burstToo many L0 files causing large compaction jobsLower level0_file_num_compaction_trigger or increase max_background_compactions
High CPU usage in background threadsCompaction threads competing with foreground writesAllocate dedicated CPU cores, tune max_background_flushes
Disk space growth > 2× datatarget_file_size_base too small → many tiny filesRaise target_file_size_base (e.g., 64 MiB → 128 MiB)

Sample Configuration (YAML for options.yaml)

# Leveled compaction tuned for 500 GB dataset on NVMe SSDs
disable_auto_compactions: false
compaction_style: kCompactionStyleLevel
target_file_size_base: 134217728   # 128 MiB
max_bytes_for_level_base: 1073741824  # 1 GiB (Level‑0 size budget)
max_bytes_for_level_multiplier: 10
level0_file_num_compaction_trigger: 4
max_background_compactions: 4
max_background_flushes: 2

Tiered Compaction (TC)

Tiered compaction (also called Universal compaction in RocksDB) groups files into tiers based on size, not level. Overlapping files are allowed within a tier, and compaction merges only when a tier’s total size exceeds a threshold.

How TC Works

  1. Flush – Same as LC, writes become L0 files.
  2. Tier Formation – Files are bucketed into tiers where each tier’s size is a multiple (size_ratio) of the previous tier.
  3. Compaction Trigger – When a tier’s cumulative size surpasses max_bytes_for_tier, RocksDB merges files within that tier into a larger tier.
  4. Garbage Collection – Optionally, TTL‑based compaction (allow_trivial_move) can drop obsolete keys without full merges.

Because TC postpones merges, write amplification can be as low as 1.5×, but read amplification grows (multiple overlapping files per tier). This is ideal for append‑only logs, time‑series data, or batch ingestion pipelines where reads are infrequent.

When to Choose Tiered

  • Write‑heavy ingestion (e.g., Kafka log segments, metric collectors).
  • Large immutable datasets where reads are batch‑oriented.
  • Environments with abundant SSD space (TC can temporarily double storage usage).

Production Pitfalls

SymptomLikely CauseFix
Reads scanning dozens of filesTiered compaction left many overlapping SSTsEnable optimize_filters_for_hits or switch to a hybrid compaction style
Disk usage spikes during bulk loadmax_tier_bytes too low → aggressive mergingRaise max_tier_bytes or temporarily switch to kCompactionStyleLevel for the load
Long tail latency on point readsTTL compaction not catching stale keysSet periodic_compaction_seconds or enable ttl_seconds per column family

Sample Configuration (JSON for programmatic API)

{
  "disable_auto_compactions": false,
  "compaction_style": "kCompactionStyleUniversal",
  "target_file_size_base": 268435456,
  "max_bytes_for_tier": 2147483648,
  "size_ratio": 2,
  "allow_trivial_move": true,
  "max_background_compactions": 6,
  "max_background_flushes": 3
}

Architecture: Mixing Compaction Strategies in Production

Many large‑scale services don’t commit to a single strategy. Facebook’s RocksDB‑based log store (used by Kafka on RocksDB) runs Leveled for hot index columns and Tiered for the raw log segment column family. This hybrid approach exploits the strengths of each mode.

Column Families as Strategy Boundaries

RocksDB lets you create column families—independent key spaces with their own options. Example architecture:

+----------------+       +--------------------+
|   UserIndex CF | <---> | Leveled Compaction |
+----------------+       +--------------------+
|   EventLog CF  | <---> | Tiered Compaction  |
+----------------+       +--------------------+
  • UserIndex CF: Frequent point lookups (Get), low write volume → Leveled.
  • EventLog CF: High‑throughput appends, occasional range scans → Tiered.

Deploying with Kubernetes

A typical deployment uses a side‑car init container to generate options.yaml per pod, based on environment variables that reflect the workload. Below is a Bash snippet that selects the compaction style at container start:

#!/usr/bin/env bash
if [[ "$CF_NAME" == "eventlog" ]]; then
  cat <<EOF > /data/options.yaml
disable_auto_compactions: false
compaction_style: kCompactionStyleUniversal
target_file_size_base: 268435456
max_bytes_for_tier: 4294967296
size_ratio: 2
allow_trivial_move: true
max_background_compactions: 6
max_background_flushes: 3
EOF
else
  cat <<EOF > /data/options.yaml
disable_auto_compactions: false
compaction_style: kCompactionStyleLevel
target_file_size_base: 134217728
max_bytes_for_level_base: 1073741824
max_bytes_for_level_multiplier: 10
level0_file_num_compaction_trigger: 4
max_background_compactions: 4
max_background_flushes: 2
EOF
fi
exec "$@"

The pod spec passes CF_NAME via an env var, allowing a single Docker image to serve both families.

Tuning Tips & Benchmarks

Below are the most impactful knobs, ordered by typical ROI. All values are examples; always benchmark on your hardware.

ParameterImpactRecommended Starting Point
write_buffer_sizeControls memtable size; larger buffers reduce flush frequency.64 MiB – 256 MiB (per column family)
max_write_buffer_numberNumber of memtables that can exist simultaneously.3 – 5
target_file_size_baseLarger files → fewer compactions, higher write throughput.128 MiB (LC) / 256 MiB (TC)
max_background_compactionsParallelism of compaction threads.#CPU cores – 1
rate_limiter_bytes_per_secCaps I/O; prevents compaction from starving foreground reads.500 MiB/s for NVMe, 100 MiB/s for SATA
compressionkLZ4Compression balances CPU and space; kNoCompression boosts write speed.LZ4 for most, NoCompression for pure logs
optimize_filters_for_hitsImproves Bloom filter effectiveness for hot keys.true (LC)
ttl_secondsEnables automatic expiration for time‑series data.Set per column family if applicable

Benchmark Snapshot (single‑node, 8‑core Xeon, 1 TB NVMe)

WorkloadCompactionQPS (writes)Avg Read Latency (µs)Write Amplification
100 % inserts, 1 KB payloadLeveled150 k1802.9×
100 % inserts, 1 KB payloadTiered320 k6201.7×
90 % reads, 10 % writesLeveled75 k952.8×
90 % reads, 10 % writesTiered68 k2101.9×

Result: Tiered compaction roughly 2.1× higher write QPS, but higher read latency. Choose based on your SLAs.

Real‑World Checklist

  1. Profile the workload – Use rocksdb.estimate-num-keys and rocksdb.stats to see read/write ratios.
  2. Pick a compaction style per column family – Align with access patterns.
  3. Set write_buffer_size to fill ~80 % of your RAM (leave room for OS cache).
  4. Enable a rate limiter if you share the SSD with other services.
  5. Monitor CompactionTime and CompactionCPUTime via Prometheus exporter (see RocksDB Prometheus Exporter).
  6. Run a rolling restart after any config change; RocksDB reloads options on DB::SetOptions.
  7. Periodically run rocksdb::CompactRange on cold column families to reclaim space.

Key Takeaways

  • Leveled compaction offers bounded read latency at the cost of higher write amplification; ideal for hot key‑value lookups.
  • Tiered compaction minimizes write amplification and maximizes ingest speed, but reads may scan many overlapping SST files.
  • Use column families to apply different compaction strategies within the same RocksDB instance.
  • Tune memtable size, target file size, and background thread counts before adjusting more exotic knobs.
  • Always measure: write throughput, read latency, and write amplification are the three axes that dictate which strategy wins for your SLA.

Further Reading