Build an LSM-tree storage engine (LevelDB / RocksDB style)
11 scenes · ~77 min · build the primitive

Build your own Build an LSM-tree storage engine (LevelDB / RocksDB style)

The simplest possible storage engine that gives you BOTH ordered reads AND more keys than fit in RAM, by accepting a deal: write to RAM at memory speed, log to disk for safety, then merge sorted files in the background forever.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06
  7. 07
  8. 08
  9. 09
  10. 10
  11. 11
  1. 01
    Why Bitcask cannot give you ranges
    Bitcask died on cardinality, the B-tree died on write rate. The brief LSM is built to satisfy: ordered reads, more keys than fit in RAM, no seek per write.
    ~7 min
  2. 02
    Sort in RAM, log to disk
    Every put fans out: a sorted in-RAM memtable AND an append-only WAL whose only job is to rebuild the memtable on a crash.
    ~7 min
  3. 03
    Flush — pay the IOU once
    Frozen memtable streams sequentially to disk as a single immutable sorted file (an SSTable); the WAL is then deleted.
    ~7 min
  4. 04
    Newest wins — and the reads get slower
    A get walks memtable → SSTables newest-first; first hit wins. Cost grows linearly with file count, especially for misses.
    ~7 min
  5. 05
    Definitely-not-here, in one bitmap
    Each SSTable carries a bloom filter that says 'definitely no' or 'maybe' — eliminating disk reads on misses, not on hits.
    ~7 min
  6. 06
    Compaction — the GC of sorted files
    A background process merges N immutable SSTables into 1, drops shadowed/deleted records, atomically swaps files. Foreground writes never block.
    ~7 min
  7. 07
    Levels — a staircase that bounds reads
    L0 may overlap; L1+ enforce non-overlap per level; each level is ~10× the size of the one above. A get touches at most one file per tier.
    ~7 min
  8. 08
    Every byte, written ten times
    Leveled compaction pays 10–30× write amplification; tiered/universal trades that down to 3–9× at the cost of read-amp and space-amp.
    ~7 min
  9. 09
    Tombstones — deletes that can come back
    Deletes are records. Drop a tombstone before all older live records are gone, and the deleted key resurrects. Same trap as Bitcask, restated for levels.
    ~7 min
  10. 10
    Blocks, cache, and the CPU/disk dial
    SSTables are blocks; the block cache holds hot decompressed blocks; compression trades CPU for disk bytes with no effect on correctness.
    ~7 min
  11. 11
    Design your LSM — and feel its trades
    Capstone: pick a workload, set knobs, watch the verifier trace each choice back to the scene that taught it. The amp triangle is the load-bearing trade.
    ~7 min