Build an LSM-tree storage engine (LevelDB / RocksDB style) (11 scenes)
Scene 04 · Newest wins — and the reads get slower
A get walks memtable → SSTables newest-first; first hit wins. Cost grows linearly with file count, especially for misses.
Previously
Disk now holds one or more sorted SSTables — but a single get has to know which one to read. The rule is simple, but its cost grows with every flush.
Scene 04
Newest wins — and the reads get slower
Diagram
RAM column on the left holds the memtable (the sorted in-RAM table from scene 2; often empty here). The SSTable shelf on the right (immutable sorted files from scene 3) is sorted newest-at-top — newer files sit above older ones. A READ CURSOR (an animated marker tracking where a get is currently looking) descends memtable → newest SSTable → next → … and stops on the first place that has the key. A 'files consulted' counter (how many places the cursor visited) and a latency readout sit beneath the cursor.
Three SSTables on the shelf. Two GETs run: one whose key is in the newest file, one whose key is only in the oldest. Watch the cursor descend; watch the counter climb.
Implementation
lsm.get (newest-first walk)
memtable, then SSTables newest-to-oldest; first hit wins
1def get(key):2 if key in memtable:3 return memtable[key]4 for sst in sstables_newest_first(): # sorted by recency5 if key not in sst.key_range: continue6 v = sst.read(key) # block index + 1 block7 if v is FOUND:8 return v # may be an absence marker9 return not_found # missed every file
Why first-hit-wins is correct
newer writes ALWAYS sit above older — cursor stops fresh
1# invariant the flush + WAL pipeline maintains:2# sst[i].max_seqno < sst[j].max_seqno for newer j3# so:4# if a key has a live record in sst[i] AND sst[j] (j newer),5# the cursor sees sst[j]'s record first and returns IT.6# older shadowed records are correct to ignore.7# they are NOT correct to delete — see scene 9.