Build an LSM-tree storage engine (LevelDB / RocksDB style) (11 scenes)
Scene 09 · Tombstones — deletes that can come back
Deletes are records. Drop a tombstone before all older live records are gone, and the deleted key resurrects. Same trap as Bitcask, restated for levels.
Previously
We've only put records in. Deletes are records too — and where they LIVE on the staircase, and how long, decides whether your delete actually sticks.
Scene 09
Tombstones — deletes that can come back
Diagram
Top: a six-step storyboard — PUT k=v1, DELETE k, compaction, the older file beneath, GET k, verdict. The new term lands here: a TOMBSTONE is a record whose value is a special 'absent' marker (rendered as ⊥) — it doesn't remove the key from disk, it RECORDS that the key has been deleted, so the read path returns not_found instead of the older live value beneath. Middle: a mini level staircase (scene 7) showing files + their records (live in blue, tombstone in red, the spotlight record outlined). Bottom: a verdict panel reporting the final GET k result (v1 resurrected = WRONG — the delete was undone; not_found = correct).
Six storyboard steps with the NAIVE rule (the last is the verdict). Watch the tombstone get dropped at step 3 and the older live record resurrect at step 5.
Implementation
Compaction tombstone rule
the line between 'safe to drop' and 'resurrection bug'
1def can_drop_tombstone(t, compaction_inputs, level):2 if level == BOTTOM_LEVEL: return True # nothing deeper3 for deeper_level in levels_below(level):4 if deeper_level.has_overlapping(t.key):5 return False # older live may exist6 return True78# the WRONG rule (naive — used in scene's storyboard):9def naive_can_drop(t, inputs):10 return not any(r.key == t.key for r in inputs)
Range tombstone (variant)
delete a key range without writing per-key tombstones
1# RocksDB DeleteRange — one record covers many keys:2range_tombstone = (begin_key, end_key, seqno)3# stored in a separate meta-block per SSTable4# read path: check both per-key tombstones AND range tombstones5# subject to the SAME retention rule across levels.67# LevelDB has no DeleteRange; it must DELETE each key individually.