TL;DR — B‑Trees keep data in a balanced, in‑memory node structure with predictable read latency but higher write cost, while LSM Trees batch writes into immutable files to minimize write amplification at the expense of larger memory footprints for compaction and Bloom filters.
Modern storage engines rarely rely on a single indexing structure. Instead, they pick the data structure that best matches their workload, hardware, and operational constraints. Two of the most common choices are the classic B‑Tree and the log‑structured merge (LSM) Tree. While both aim to provide ordered key‑value access, they make very different compromises in how they allocate, use, and reclaim memory. Understanding those compromises is essential for anyone who designs, tunes, or operates a database system.
Foundations of B‑Tree Memory Management
Node layout and in‑memory caching
A B‑Tree stores keys and pointers in pages (or nodes) that are typically the same size as the underlying storage block—commonly 4 KB or 8 KB. Each page is loaded into the buffer pool (a region of RAM managed by the engine) before it can be examined or modified. The buffer pool acts as a cache; the most frequently accessed pages stay resident, while cold pages are evicted according to a replacement policy such as LRU or CLOCK.
Because B‑Tree pages are fixed‑size, the memory footprint of the buffer pool is easy to predict:
buffer_pool_size = number_of_pages * page_size
If a database allocates a 2 GB buffer pool with 8 KB pages, it can hold roughly 256 k pages. The engine can therefore guarantee a maximum amount of memory dedicated to index caching, leaving the rest of the process space for other purposes (e.g., query execution buffers).
Write path and page splits
When inserting a new key, the engine first fetches the leaf page that will contain the key. If that page has free slots, the key is placed directly. If the page is full, a split occurs: the page is divided into two, and a new entry is propagated up the tree. Splits are expensive because they involve:
- Reading the full page into memory (if not already cached).
- Allocating a new page in the buffer pool.
- Writing both the original and the new page back to disk.
The memory cost of a split is transient—a few extra pages occupy the buffer pool for the duration of the operation. However, a high insert rate can lead to frequent splits, causing the buffer pool to churn and reducing cache hit ratios for read‑heavy workloads.
LSM Tree Memory Management Basics
MemTable, immutable segments, and Bloom filters
An LSM Tree separates writes from reads by first inserting new entries into an in‑memory structure called a MemTable. Most implementations use a sorted data structure such as a skip list or a balanced tree (often a B‑Tree variant). The MemTable resides entirely in RAM until it reaches a configurable size (e.g., 64 MB). At that point it is flushed to disk as an immutable sorted string table (SSTable).
Each SSTable is accompanied by a Bloom filter—a compact probabilistic data structure that lives in memory and quickly tells the engine whether a key might be present in the file. Bloom filters dramatically reduce unnecessary disk reads, but they consume extra RAM:
bloom_memory ≈ (bits_per_key * number_of_keys) / 8
If an LSM system stores 1 billion keys with 10 bits per key, the Bloom filter alone requires ~1.25 GB of RAM.
Compaction and write amplification
Over time, the LSM hierarchy accumulates many SSTables across multiple levels. To keep read latency bounded, the engine periodically compacts overlapping files: it merges them, discards deleted or overwritten keys, and writes a new, larger SSTable. Compaction is I/O‑intensive and also memory‑intensive because the engine must buffer portions of each input file while performing the merge.
During a compaction, memory usage spikes:
- Input buffers for each participating SSTable (often 1 MB per file).
- Output buffer for the new SSTable.
- Temporary structures for duplicate elimination.
If a level contains dozens of files, the cumulative input buffer size can exceed hundreds of megabytes. Consequently, LSM engines typically allocate a separate compaction memory budget that is added on top of the MemTable and Bloom filter allocations.
Comparative Analysis
Read latency and cache efficiency
| Aspect | B‑Tree | LSM Tree |
|---|---|---|
| Primary cache | Buffer pool holds leaf and interior pages. Cache hit directly yields the record. | Buffer pool holds recent MemTables + Bloom filters + a small block cache for hot SSTable blocks. |
| Cache predictability | Fixed‑size pages → easy to size the pool. | Bloom filters are fixed, but the number of SSTable blocks in cache depends on read patterns and compaction activity. |
| Worst‑case read path | May need to traverse log_fanout(N) pages, each potentially a cache miss. | May need to probe multiple Bloom filters and read several SSTable blocks if the key resides in an older level. |
In practice, B‑Trees often provide lower average read latency for point lookups because a single page lookup often suffices. LSM Trees shine when the Bloom filter eliminates most disk accesses; however, false positives force extra reads, and range scans must touch many SSTables, increasing latency.
Write amplification and memory overhead
| Metric | B‑Tree | LSM Tree |
|---|---|---|
| Write amplification | Typically 1–2× (page write + possible split). | Typically 5–10× (MemTable flush + multiple compaction passes). |
| Memory for writes | Buffer pool page for the leaf plus a new page on split. | MemTable (configurable) + Bloom filter growth + compaction buffers. |
| Impact of larger memory | Larger buffer pool reduces page evictions, improving read performance but does not affect write amplification. | More MemTable space reduces flush frequency, lowering write amplification; larger Bloom filters cut false positives, improving reads. |
Thus, if an operator can afford a generous RAM budget, an LSM engine can be tuned to behave almost like a B‑Tree for writes (by increasing MemTable size). Conversely, a B‑Tree’s write amplification is largely invariant to RAM size.
Impact on SSD vs. HDD storage
- SSD: Random reads are cheap, so B‑Tree’s page‑level random I/O is less penalizing. However, SSDs also benefit from sequential writes; LSM’s write‑once pattern aligns well with SSD endurance models.
- HDD: Random reads are expensive; LSM’s sequential flushes and compactions are favorable. Yet, compaction churn can cause additional seeks, partially offsetting the advantage.
A hybrid approach—using a B‑Tree for hot data and an LSM for cold data—appears in systems like MariaDB’s Aria engine and PostgreSQL’s pg_lsm extension, illustrating that the trade‑off is not binary.
Operational Considerations
Tuning parameters (cache size, block size, level size)
| Parameter | B‑Tree effect | LSM effect |
|---|---|---|
shared_buffers (PostgreSQL) | Directly expands the page cache, improving hit rate. | No effect on LSM core; only affects other caches. |
innodb_buffer_pool_size (MySQL) | Same as above for InnoDB B‑Tree indexes. | Same as above if InnoDB uses its own LSM plugin. |
memtable_size (RocksDB) | N/A | Larger size reduces flush frequency, but increases memory pressure. |
target_file_size_base (RocksDB) | N/A | Controls SSTable size; larger files reduce number of files per level, lowering Bloom filter memory. |
block_cache_size (RocksDB) | N/A | Directly allocates memory for caching SSTable blocks; sizing it appropriately is crucial for read latency. |
bloom_filter_bits_per_key | N/A | Higher bits per key reduce false positives but increase RAM usage. |
Tuning is iterative: increase a memory budget, observe cache hit ratios (e.g., via pg_statio_user_indexes or RocksDB’s stats_dump), then adjust. Over‑allocating memory to Bloom filters can starve the block cache, causing more disk reads despite lower false positive rates.
Monitoring memory pressure
Real‑time metrics help avoid out‑of‑memory (OOM) crashes, especially during large compactions. Example Bash snippet that watches RocksDB memory usage:
#!/usr/bin/env bash
while true; do
# Pull RocksDB stats via its HTTP endpoint (assumes metrics are exposed)
curl -s http://localhost:8080/metrics | grep -E 'rocksdb_(memtable|block_cache|bloom_filter)_bytes'
sleep 5
done
For PostgreSQL B‑Tree buffers:
SELECT
sum(blks_hit) AS cache_hits,
sum(blks_read) AS disk_reads,
sum(blks_hit) / nullif(sum(blks_hit) + sum(blks_read),0) AS hit_ratio
FROM pg_stat_user_indexes;
When the hit ratio drops below ~90 % on a B‑Tree‑heavy workload, consider expanding shared_buffers. For LSM, a rising memtable_flush_pending or compaction_pending metric indicates that the current memory allocation cannot keep up with write volume.
Key Takeaways
- Predictable caching: B‑Trees use a fixed‑size page cache that is easy to size; LSM Trees rely on a mix of MemTable, Bloom filters, and block cache, making memory budgeting more complex.
- Write amplification: B‑Trees incur modest amplification (1–2×) regardless of RAM; LSM Trees can amplify writes 5–10×, but larger MemTables and smarter compaction can mitigate the effect.
- Read patterns: Point lookups benefit from B‑Tree’s single‑page access; LSM Trees shine when Bloom filters eliminate most disk reads, especially on SSDs.
- Hardware alignment: HDD‑heavy environments often favor LSM due to sequential writes, while SSDs can accommodate either structure with proper tuning.
- Operational tuning: Adjust buffer pool sizes for B‑Trees, and balance MemTable size, Bloom filter bits, and block cache for LSM Trees; always monitor hit ratios and compaction queues.
- Hybrid solutions: Some modern systems combine both structures to exploit the strengths of each, suggesting that the “B‑Tree vs. LSM” decision is increasingly nuanced.