Back to catalog
Build an LSM-tree storage engine (LevelDB / RocksDB style)
11 scenes · ~77 min · build the primitive
Build your own Build an LSM-tree storage engine (LevelDB / RocksDB style)
The simplest possible storage engine that gives you BOTH ordered reads AND more keys than fit in RAM, by accepting a deal: write to RAM at memory speed, log to disk for safety, then merge sorted files in the background forever.
- 01Why Bitcask cannot give you rangesBitcask died on cardinality, the B-tree died on write rate. The brief LSM is built to satisfy: ordered reads, more keys than fit in RAM, no seek per write.~7 min
- 02Sort in RAM, log to diskEvery put fans out: a sorted in-RAM memtable AND an append-only WAL whose only job is to rebuild the memtable on a crash.~7 min
- 03Flush — pay the IOU onceFrozen memtable streams sequentially to disk as a single immutable sorted file (an SSTable); the WAL is then deleted.~7 min
- 04Newest wins — and the reads get slowerA get walks memtable → SSTables newest-first; first hit wins. Cost grows linearly with file count, especially for misses.~7 min
- 05Definitely-not-here, in one bitmapEach SSTable carries a bloom filter that says 'definitely no' or 'maybe' — eliminating disk reads on misses, not on hits.~7 min
- 06Compaction — the GC of sorted filesA background process merges N immutable SSTables into 1, drops shadowed/deleted records, atomically swaps files. Foreground writes never block.~7 min
- 07Levels — a staircase that bounds readsL0 may overlap; L1+ enforce non-overlap per level; each level is ~10× the size of the one above. A get touches at most one file per tier.~7 min
- 08Every byte, written ten timesLeveled compaction pays 10–30× write amplification; tiered/universal trades that down to 3–9× at the cost of read-amp and space-amp.~7 min
- 09Tombstones — deletes that can come backDeletes are records. Drop a tombstone before all older live records are gone, and the deleted key resurrects. Same trap as Bitcask, restated for levels.~7 min
- 10Blocks, cache, and the CPU/disk dialSSTables are blocks; the block cache holds hot decompressed blocks; compression trades CPU for disk bytes with no effect on correctness.~7 min
- 11Design your LSM — and feel its tradesCapstone: pick a workload, set knobs, watch the verifier trace each choice back to the scene that taught it. The amp triangle is the load-bearing trade.~7 min