TL;DR — Leveled compaction gives predictable read latency by keeping each level size‑bounded, while tiered compaction maximizes write throughput at the cost of larger read amplification. Choose leveled for latency‑sensitive workloads (e.g., user‑facing services) and tiered for ingestion‑heavy pipelines (e.g., log aggregation), then fine‑tune the relevant knobs—
target_file_size_base,max_bytes_for_level_base, andlevel0_file_num_compaction_trigger—to hit your SLA.
RocksDB powers many high‑scale systems, from Facebook’s social graph to Kafka’s persistent log store. Its performance hinges on how it reorganizes immutable SST files on disk, a process called compaction. While the default “leveled” mode works well for many OLTP workloads, the “tiered” mode (also known as “universal”) shines when write volume overwhelms the system. This article dissects both strategies, walks through the internal data paths, and provides production‑ready tuning patterns you can apply today.
1. RocksDB Architecture Primer
Before diving into compaction, it helps to visualize the storage stack:
- MemTable – an in‑memory sorted map (usually a skiplist) that receives writes.
- Write‑Ahead Log (WAL) – an append‑only file ensuring durability.
- SSTables – immutable, sorted string tables flushed from the MemTable.
- Levels / Tiers – logical groupings of SSTables that dictate when and how files are merged.
When the MemTable fills, RocksDB writes a new SST file to Level‑0. From there, compaction policies decide when to merge files into deeper levels (or tiers) to reclaim space, eliminate duplicate keys, and keep read paths short.
1.1 Why Compaction Matters
- Read Amplification – the number of SST files a read must scan. Higher amplification means more disk I/O and higher latency.
- Write Amplification – the total amount of data rewritten during compactions. Excessive rewrite can saturate I/O and increase SSD wear.
- Space Amplification – temporary storage overhead while files are being merged.
Balancing these three amplifications is the essence of any compaction strategy.
2. Leveled Compaction (LC)
Leveled compaction organizes data into a series of levels (L0, L1, …, LN) where each level after L0 has a strict size bound, typically 10× the size of the previous level. Files in a level never overlap; they cover disjoint key ranges. This property enables point reads to touch at most one file per level, yielding predictable latency.
2.1 How Leveled Works
- L0 Overlap – L0 files can overlap arbitrarily. When the number of L0 files exceeds
level0_file_num_compaction_trigger(default 4), RocksDB selects a subset and compacts them into L1. - Size‑Bounded Levels – L1 is limited to
max_bytes_for_level_base(default 256 MiB). If L1 exceeds this, RocksDB picks overlapping L1 files and merges them with the selected L0 files into L2, respecting the size bound of L2 (max_bytes_for_level_multiplier * max_bytes_for_level_base). - Compaction Trigger – Each level has a target file size (
target_file_size_basemultiplied by the level index). When a level’s total size exceeds its bound, a compaction is scheduled.
The result is a log‑structured merge tree (LSM) where each key lives in at most one file per level, keeping read amplification close to #levels + 1.
2.2 Tuning Leveled for Low Latency
| Parameter | Typical Range | Effect |
|---|---|---|
target_file_size_base | 4 MiB – 64 MiB | Smaller files reduce write amplification but increase manifest overhead. |
max_bytes_for_level_base | 64 MiB – 512 MiB | Shrinking this tightens level sizes, lowering read amplification at the cost of more frequent compactions. |
level0_file_num_compaction_trigger | 2 – 8 | Lower values trigger earlier compactions, reducing L0 read spikes. |
soft_pending_compaction_bytes_limit / hard_pending_compaction_bytes_limit | 1 GiB / 2 GiB (example) | Prevents background compaction backlog from exploding. |
Example – A micro‑service handling 10 k ops/s with 95th‑percentile read latency < 5 ms:
target_file_size_base: 8MiB
max_bytes_for_level_base: 128MiB
level0_file_num_compaction_trigger: 3
soft_pending_compaction_bytes_limit: 1073741824 # 1 GiB
hard_pending_compaction_bytes_limit: 2147483648 # 2 GiB
These settings keep L1–L3 small enough that a point read touches ≤ 4 files, while the background compaction threads (usually 2–4) can keep up with the write rate.
2.3 Failure Modes to Watch
- L0 Flood – If write bursts exceed compaction throughput, L0 can accumulate > 100 files, causing read latency spikes. Mitigate by increasing
max_background_compactionsor loweringlevel0_file_num_compaction_trigger. - Write Stalls – When the total pending compaction bytes hit the hard limit, RocksDB blocks writes. Adjust the limits or provision faster storage (e.g., NVMe) to avoid stalls.
- Space Blow‑up – Aggressive
target_file_size_basewith a smallmax_bytes_for_level_basecan cause temporary space usage > 2× the data set during massive compactions.
3. Tiered (Universal) Compaction
Tiered compaction, also called universal compaction, abandons the strict size hierarchy. Instead, it groups SST files into tiers based on overlap and age. Files within a tier can overlap, and compaction merges a set of files into a larger tier when certain thresholds are met. This model is optimized for write‑heavy workloads because it reduces the number of times a key is rewritten.
3.1 How Tiered Works
- File Age Buckets – New SSTs start in Tier 0. As they age (measured in number of compactions), they move to higher tiers.
- Size Ratio – The
universal_compaction_size_ratio(default 1) controls when files are merged: if the total size of newer files exceedssize_ratio × size_of_older_tier, a compaction is triggered. - Overlap Threshold –
universal_compaction_overlap_ratio(default 0) determines how much key overlap is allowed before forcing a merge. Setting it > 0 can limit read amplification. - Max Tiers –
max_background_compactionsstill caps parallel compaction threads, but tiered compaction often requires fewer because merges are larger and less frequent.
The net effect is high write throughput (often > 100 k ops/s on a single node) with higher read amplification because overlapping files must be consulted during reads.
3.2 Tuning Tiered for Ingestion Pipelines
| Parameter | Typical Range | Effect |
|---|---|---|
universal_compaction_size_ratio | 1 – 10 | Larger ratios delay merges, boosting write throughput but increasing read amplification. |
universal_compaction_overlap_ratio | 0 – 1 | Raising this forces earlier merges of overlapping files, reducing read amplification at the cost of write performance. |
max_bytes_for_level_base (unused) | N/A | Tiered ignores level size limits, simplifying configuration. |
allow_ingest_behind | true/false | Enables ingest‑behind where external files are added without immediate compaction, useful for bulk loads. |
Example – A log‑aggregation service ingesting 200 MB/s:
universal_compaction_size_ratio: 5
universal_compaction_overlap_ratio: 0.1
allow_ingest_behind: true
max_background_compactions: 2
Here, the system tolerates up to 5× size before merging, keeping the write path almost lock‑free, while a modest overlap ratio prevents reads from scanning more than ~2 files per key on average.
3.3 Failure Modes to Watch
- Read Amplification Blow‑up – With
universal_compaction_overlap_ratio = 0, reads may need to scan dozens of overlapping SSTs. Monitorrocksdb.estimate-num-keysandrocksdb.estimate-live-data-sizeto detect. - Compaction Storm – If
size_ratiois too high and a sudden surge of small files arrives, the background compaction thread may become saturated, leading to a backlog. Reducesize_ratioor increasemax_background_compactions. - Space Exhaustion – Tiered compaction can temporarily allocate space equal to the sum of all pending files. Ensure the underlying volume has headroom (≥ 2× expected data size).
4. Patterns in Production
Real‑world systems rarely stick to a single compaction mode; they blend configurations to meet both latency and throughput goals.
4.1 Hybrid Approach: Leveled for Hot Keys, Tiered for Cold
- Hot Partition – Use a column family with
compaction_style: kCompactionStyleLevelfor user‑profile reads that demand < 2 ms latency. - Cold Partition – Store event logs in a separate column family with
compaction_style: kCompactionStyleUniversalto maximize ingestion speed. - Cross‑CF Queries – RocksDB supports multi‑CF reads; keep the number of overlapping CFs low to avoid compounded read amplification.
rocksdb::ColumnFamilyOptions hot_opts;
hot_opts.compaction_style = rocksdb::kCompactionStyleLevel;
hot_opts.target_file_size_base = 8 << 20; // 8 MiB
rocksdb::ColumnFamilyOptions cold_opts;
cold_opts.compaction_style = rocksdb::kCompactionStyleUniversal;
cold_opts.universal_compaction_size_ratio = 4;
4.2 Tiered as a Staging Layer
Many pipelines ingest raw events via Kafka → RocksDB → Batch Export. A common pattern:
- Ingest – Write to a tiered column family with
allow_ingest_behind = true. Bulk files (e.g., Parquet → SST) are added without immediate compaction. - Compact on Schedule – Trigger a manual
CompactRangeduring off‑peak windows to merge overlapping files, reducing read amplification for downstream analytics. - Archive – After compaction, move the SST files to cold storage (e.g., GCS) using RocksDB’s
ExportColumnFamilyAPI.
# Trigger manual compaction for the "events" CF
rocksdb-cli --db_path=/data/rocksdb --cf=events compact_range "" ""
4.3 Monitoring Metrics
| Metric | Why It Matters | Typical Alert Threshold |
|---|---|---|
rocksdb.num-files-at-level<N> | Tracks file count per level; high L0 indicates compaction lag. | L0 > 50 |
rocksdb.compaction-pending | Boolean flag showing pending compactions. | true for > 30 seconds |
rocksdb.estimate-pending-compaction-bytes | Approximation of write‑amplification backlog. | > 2 GiB |
rocksdb.read-amplification (derived) | Number of files a read must check. | > 8 for latency‑critical CFs |
rocksdb.write-stall | Indicates RocksDB is throttling writes. | true for > 10 seconds |
Integrate these metrics into Prometheus and set alerts via Alertmanager. The Grafana dashboard shared by the RocksDB community (see the official repo) visualizes these metrics nicely.
5. Architecture Considerations
When choosing a compaction strategy, think beyond the DB settings and ask:
- Hardware Profile – NVMe SSDs with high IOPS favor tiered compaction (writes dominate); SATA drives with limited IOPS benefit from leveled compaction to keep read paths short.
- Workload Mix – A 70/30 read/write split usually leans to leveled; > 80% writes pushes you toward tiered.
- SLA Priorities – If 99.9th‑percentile latency is a hard contract, enforce tight
max_bytes_for_level_baseand allocate extra compaction threads. - Multi‑Tenant Isolation – Use separate column families per tenant, each with its own compaction style, to avoid noisy neighbor effects.
5.1 Example Deployment Diagram
+-------------------+ +-------------------+ +-------------------+
| Front‑End API | --> | RocksDB Instance | --> | Analytics Cluster|
| (Node.js/Go) | | (Leveled CF: Users| | (Batch Export) |
| | | Tiered CF: Logs) | | |
+-------------------+ +-------------------+ +-------------------+
| | |
| Write Path (WAL+Mem) | Compaction Threads |
+---------------------------+--------------------------+
The diagram highlights that read‑heavy API traffic hits the leveled column family, while write‑heavy log streams funnel into the tiered column family.
6. Key Takeaways
- Leveled compaction offers predictable read latency by bounding each level’s size; ideal for user‑facing services.
- Tiered (universal) compaction maximizes write throughput and is suited for ingestion pipelines or bulk data loads.
- Tune
target_file_size_base,max_bytes_for_level_base, andlevel0_file_num_compaction_triggerfor low‑latency scenarios; adjustuniversal_compaction_size_ratioanduniversal_compaction_overlap_ratiofor high‑throughput pipelines. - Monitor L0 file count, pending compaction bytes, and read‑amplification metrics to catch compaction stalls early.
- Consider a hybrid column‑family layout: hot data in leveled, cold data in tiered, and use manual compaction windows for batch cleanup.