Build an LSM-tree storage engine (LevelDB / RocksDB style) (11 scenes)
Scene 10 · Blocks, cache, and the CPU/disk dial
SSTables are blocks; the block cache holds hot decompressed blocks; compression trades CPU for disk bytes with no effect on correctness.
Previously
Reads, writes, deletes, and bounded levels are all sorted out. But every read so far has touched a whole SSTable. Real systems read BLOCKS, cache them, and squeeze them with compression.
Scene 10
Blocks, cache, and the CPU/disk dial
Diagram
The selected SSTable (scene 3) on the shelf zooms into a row of BLOCKS — fixed-size chunks (default 4 KB) that are the unit of disk I/O for an LSM, NOT individual keys. A tiny block-index above the strip lists the first key of each block → its byte offset, so the read path binary-searches the index and reads ONE block per get. In the RAM column, the new term lands: a BLOCK CACHE — an in-RAM map holding recently-read DECOMPRESSED blocks, so a hot working set never touches disk again after the first read. A 'recent hits/misses' strip shows cache effectiveness; a hit-rate readout updates live. Compression knob (none / lz4 / zstd) trades CPU for disk bytes per block.
Watch one GET land on one block (not the whole file), and a second GET in the same block hit the cache. Hot reads stay in RAM.
Implementation
Block-aware read
block index → 1 block → cache
1def read_value(sst, key):2 block_offset = sst.block_index.lookup(key) # binary search in RAM3 if (sst.id, block_offset) in block_cache:4 block = block_cache.get(sst.id, block_offset) # RAM hit5 else:6 raw = disk.pread(sst.id, block_offset, block_size)7 block = decompress(raw) # CPU work8 block_cache.put(sst.id, block_offset, block)9 return scan_block(block, key)
Compression: pure CPU↔disk
correctness unchanged; bytes and CPU move
1# compression presets:2# none → 1.0× disk, 1.0× CPU (fastest reads, biggest disk)3# lz4 → 0.55× disk, 1.4× CPU (LSM default)4# zstd → 0.40× disk, 2.6× CPU (best ratio, slowest decode)5# the choice is a workload knob, not a correctness knob.