Blocks, cache, and the CPU/disk dial

Build an LSM-tree storage engine (LevelDB / RocksDB style) (11 scenes)

Scene 10 · Blocks, cache, and the CPU/disk dial

SSTables are blocks; the block cache holds hot decompressed blocks; compression trades CPU for disk bytes with no effect on correctness.

Previously

Reads, writes, deletes, and bounded levels are all sorted out. But every read so far has touched a whole SSTable. Real systems read BLOCKS, cache them, and squeeze them with compression.

Scene 10

Blocks, cache, and the CPU/disk dial

Watch

Diagram

The selected SSTable (scene 3) on the shelf zooms into a row of BLOCKS — fixed-size chunks (default 4 KB) that are the unit of disk I/O for an LSM, NOT individual keys. A tiny block-index above the strip lists the first key of each block → its byte offset, so the read path binary-searches the index and reads ONE block per get. In the RAM column, the new term lands: a BLOCK CACHE — an in-RAM map holding recently-read DECOMPRESSED blocks, so a hot working set never touches disk again after the first read. A 'recent hits/misses' strip shows cache effectiveness; a hit-rate readout updates live. Compression knob (none / lz4 / zstd) trades CPU for disk bytes per block.

Sources

Watch one GET land on one block (not the whole file), and a second GET in the same block hit the cache. Hot reads stay in RAM.

Implementation

Block-aware read

block index → 1 block → cache

1def read_value(sst, key):
2    block_offset = sst.block_index.lookup(key)  # binary search in RAM
3    if (sst.id, block_offset) in block_cache:
4        block = block_cache.get(sst.id, block_offset)  # RAM hit
5    else:
6        raw = disk.pread(sst.id, block_offset, block_size)
7        block = decompress(raw)                  # CPU work
8        block_cache.put(sst.id, block_offset, block)
9    return scan_block(block, key)

Compression: pure CPU↔disk

correctness unchanged; bytes and CPU move

1# compression presets:
2#   none → 1.0× disk, 1.0× CPU   (fastest reads, biggest disk)
3#   lz4  → 0.55× disk, 1.4× CPU  (LSM default)
4#   zstd → 0.40× disk, 2.6× CPU  (best ratio, slowest decode)
5# the choice is a workload knob, not a correctness knob.

PreviousTombstones — deletes that can come back NextDesign your LSM — and feel its trades