TL;DR — Leveled compaction offers tighter read latency at the cost of write amplification, while tiered compaction scales write throughput for hot workloads. Choose leveled for latency‑sensitive services, tiered for write‑heavy pipelines, and tune thresholds based on your SSD I/O budget and key distribution.

RocksDB powers everything from high‑frequency trading platforms to real‑time analytics pipelines. Its ability to store billions of key‑value pairs on cheap SSDs hinges on how it reorganizes data on‑disk, a process called compaction. Two strategies dominate production deployments: Leveled Compaction (the default) and Tiered Compaction (often called “Universal”). This article unpacks the internal mechanics, compares architectural trade‑offs, and provides concrete configuration snippets you can drop into a Java or C++ client today.

RocksDB Basics

Before diving into compaction, it helps to recall the building blocks:

ComponentRole
MemTableIn‑memory write buffer, flushed to an SST file when full.
Write‑Ahead Log (WAL)Guarantees durability; replayed on crash recovery.
SST (Sorted String Table)Immutable on‑disk file containing a sorted slice of keys.
Levels / TiersLogical grouping of SSTs that determines when and how they are merged.

RocksDB writes are append‑only: a new version of a key lives in the newest SST, while older versions linger in lower levels until a compaction discards them. The compaction strategy determines how fast obsolete data is reclaimed and how much read‑amplification (extra SSTs scanned per query) a workload experiences.

Compaction Fundamentals

Compaction is triggered by three primary metrics:

  1. Size‑Based Triggers – When the total size of SSTs at a level exceeds a configured threshold.
  2. Count‑Based Triggers – When the number of files in a level grows beyond a limit.
  3. Write‑Stall Triggers – When the memtable cannot flush because the lowest level is full, causing the DB to pause writes.

During a compaction, RocksDB reads overlapping SSTs, merges sorted key streams, drops deleted or overwritten entries, and writes the result into new SSTs at a higher level (or the same level for tiered). The cost of compaction is measured in three dimensions:

DimensionLeveledTiered
Write Amplification5–10× (multiple levels)1–3× (single tier)
Read Amplification1–2 SSTs per query (tight)3–5 SSTs per query (broader)
Space Amplification2–3× (reserved for level size ratios)1.5–2× (less reserved)

Understanding these numbers is key to mapping a strategy onto a production SLA.

Leveled Compaction Architecture

Leveled compaction organizes SSTs into L0, L1, L2 … where each level (except L0) holds files of roughly the same size. The classic rule is a size ratio of 10: each level is ten times larger than the one above it. When L0 exceeds a file count threshold, RocksDB selects a compaction candidate from L0 and merges it with overlapping files in L1, producing new files that land in L1. This cascade repeats up the ladder.

How Overlap is Managed

RocksDB uses key range metadata stored in the manifest. For each level, it maintains a non‑overlapping set of SSTs (except L0). The algorithm:

  1. Pick L0 files based on smallest overlapping range.
  2. Identify overlapping L1 files using binary search on sorted key ranges.
  3. Merge all selected files, dropping tombstones older than the compaction stop style.
  4. Rewrite the merged output into L1, respecting the target file size (target_file_size_base).

A typical configuration in Java:

import org.rocksdb.*;

Options options = new Options();
options.setCreateIfMissing(true);
options.setCompactionStyle(CompactionStyle.LEVEL);
options.setTargetFileSizeBase(64 * 1024 * 1024); // 64 MiB
options.setLevel0FileNumCompactionTrigger(4);
options.setMaxBytesForLevelBase(256 * 1024 * 1024); // 256 MiB for L1
options.setMaxBytesForLevelMultiplier(10.0);
DB db = RocksDB.open(options, "/tmp/rocksdb_leveled");

When Leveled Shines

  • Latency‑Sensitive Reads – Because each level contains non‑overlapping files, a point lookup typically reads one file per level (often just L0 + L1). This yields sub‑millisecond latency even on spinning disks.
  • Predictable Space – The size‑ratio guarantees that the total DB size stays within a known bound (≈ 1.2× the live data set).
  • Cold‑Data Workloads – If most reads target older data that rarely changes, the extra write amplification is acceptable.

Failure Modes & Mitigations

Failure ModeSymptomMitigation
Write StallWrites pause when L6 (or last level) is full.Increase max_bytes_for_level_base or enable dynamic level bytes (setLevelCompactionDynamicLevelBytes(true)).
High Delete‑Stale RatioDeleting many keys leads to many tombstones lingering.Use setCompactionOptionsUniversal(new CompactionOptionsUniversal()) with setMaxDeletePercent(20) or run manual compaction on hot ranges.
SSD WearFrequent compactions cause write amplification → SSD wear.Switch to tiered for write‑heavy ingestion, or enable setDisableAutoCompactions(true) and schedule off‑peak compactions.

Tiered Compaction Architecture

Tiered (or Universal) compaction treats all SSTs as part of a single logical tier that grows in size. Instead of moving data up a ladder, it merges files within the same tier until the total size reaches a configured limit (max_size_amplification_percent). The result is a set of large, non‑overlapping files that can be pruned aggressively.

Core Algorithm

  1. Collect Candidates – Pick the smallest N files (by size) that exceed the min_merge_width threshold.
  2. Merge – Perform a k‑way merge, dropping obsolete entries and tombstones older than stop_style.
  3. Write Back – Output a single larger SST, optionally splitting if it exceeds target_file_size_base.
  4. Repeat – Continue until the tier’s total size respects the size amplification bound.

A YAML snippet for RocksDB’s C++ API (used in many services that embed RocksDB directly):

rocksdb:
  create_if_missing: true
  compaction_style: universal
  universal_compaction:
    min_merge_width: 2
    max_merge_width: 4
    max_size_amplification_percent: 200
    stop_style: kCompactionStopBeforeCopy
    target_file_size_base: 64MiB
    max_background_compactions: 4
  db_path: /var/lib/myservice/rocksdb_tiered

When Tiered Wins

  • Write‑Heavy Ingestion – Log streaming, IoT telemetry, or click‑stream pipelines that push millions of keys per second.
  • Large Keys with Low Read Frequency – When reads are mostly range scans over recent data, the extra read amplification is cheap.
  • Limited Write Budget – If SSD endurance is a primary concern, tiered’s lower write amplification reduces wear.

Common Pitfalls

PitfallObservationRemedy
Excessive Read AmplificationRange scans touch many files, leading to higher latency.Tune max_size_amplification_percent down (e.g., 150) or enable bottommost level compression (setCompressLevel(1)).
Compaction StormSudden burst of writes triggers many overlapping merges.Set max_background_compactions higher, or enable rate limiting (setCompactionThreadLimiter).
Space Blow‑UpTier grows beyond expected size due to large target_file_size_base.Reduce target_file_size_base to 32 MiB, ensuring more granular merges.

Patterns in Production

Real‑world systems rarely pick a strategy wholesale; they layer hybrid patterns to meet mixed SLAs.

1. Hot‑Cold Separation

  • Hot tier – Use tiered compaction for recent writes (last few hours).
  • Cold tier – Periodically trigger a background snapshot that copies hot data into a leveled DB for long‑term low‑latency reads.

Implementation sketch (pseudo‑bash):

# Step 1: Run tiered DB for ingestion
rocksdb-cli --db /data/hot --compaction_style universal &

# Step 2: Every 12h, snapshot hot DB into cold DB
rsync -a /data/hot/ /data/snapshot/
# Load snapshot into leveled instance
rocksdb-cli --db /data/cold --compaction_style level --load_snapshot /data/snapshot/

2. Multi‑Tenant Sharding

When serving many tenants, allocate a separate column family per tenant, each with its own compaction style:

ColumnFamilyOptions cfOpts = new ColumnFamilyOptions();
cfOpts.setCompactionStyle(CompactionStyle.LEVEL); // latency‑critical tenants
cfOpts.setCompactionStyle(CompactionStyle.UNIVERSAL); // bulk‑ingest tenants

3. Adaptive Compaction Switching

Some clouds expose metrics (rocksdb.num-files-at-level<N>, rocksdb.bytes-written) via Prometheus. A controller can flip options.setCompactionStyle at runtime based on thresholds (e.g., write QPS > 500k → switch to tiered). While RocksDB doesn’t support live style changes, you can re‑open the DB with new options without downtime using a rolling restart.

Performance Benchmarks

Below is a condensed benchmark from a production‑grade 8‑core Xeon, 2 TB NVMe SSD, using the YCSB workload mix:

WorkloadStrategyAvg Write Latency (ms)Avg Read Latency (ms)Write AmplificationSpace Amplification
YCSB‑A (50% reads, 50% writes)Leveled1.80.77.2×2.6×
YCSB‑ATiered1.21.42.5×1.8×
YCSB‑B (95% reads)Leveled2.00.56.8×2.5×
YCSB‑BTiered1.30.92.7×2.0×
YCSB‑C (read‑only)Leveled0.42.4×
YCSB‑CTiered0.61.9×

Key observations:

  • Tiered consistently reduces write latency and amplification, making it ideal for ingestion spikes.
  • Leveled shines on read‑heavy workloads, delivering sub‑500 µs point reads.
  • Space amplification differences are modest; both stay under 3× live data.

Key Takeaways

  • Leveled compaction offers tighter read latency and predictable storage growth at the cost of higher write amplification.
  • Tiered (Universal) compaction minimizes write amplification and SSD wear, tolerating higher read amplification—perfect for hot ingestion pipelines.
  • Choose leveled for latency‑sensitive services (e.g., order‑matching engines) and tiered for write‑heavy streams (e.g., event logging, metric collection).
  • Hybrid patterns—hot‑cold separation, per‑tenant column families, and adaptive switching—let you meet mixed SLAs without sacrificing durability.
  • Tune core knobs (target_file_size_base, max_bytes_for_level_multiplier, max_size_amplification_percent) based on your SSD IOPS budget and key distribution.
  • Monitor RocksDB’s built‑in metrics (rocksdb.bytes-written, rocksdb.num-files-at-level<N>) to detect compaction stalls early and adjust thresholds proactively.

Further Reading