Build an LSM-tree storage engine (LevelDB / RocksDB style) (11 scenes)
Scene 09 · Tombstones — deletes that can come back
Deletes are records. Drop a tombstone before all older live records are gone, and the deleted key resurrects. Same trap as Bitcask, restated for levels.
Previously

We've only put records in. Deletes are records too — and where they LIVE on the staircase, and how long, decides whether your delete actually sticks.

Scene 09
Tombstones — deletes that can come back
Diagram
Top: a six-step storyboard — PUT k=v1, DELETE k, compaction, the older file beneath, GET k, verdict. The new term lands here: a TOMBSTONE is a record whose value is a special 'absent' marker (rendered as ⊥) — it doesn't remove the key from disk, it RECORDS that the key has been deleted, so the read path returns not_found instead of the older live value beneath. Middle: a mini level staircase (scene 7) showing files + their records (live in blue, tombstone in red, the spotlight record outlined). Bottom: a verdict panel reporting the final GET k result (v1 resurrected = WRONG — the delete was undone; not_found = correct).
step 1PUT k=v1flushed to L1step 2DELETE ktombstone in L0step 3L0→L1 compaction (na…tombstone dropped — no …step 4L1 still has k=v1older file beneath was …step 5GET kno tombstone, hits v1 —…step 6verdictnaive rule resurrected …L1L1/000003.s…k=v1a=5Step 1/6 — flushed to L1
Six storyboard steps with the NAIVE rule (the last is the verdict). Watch the tombstone get dropped at step 3 and the older live record resurrect at step 5.
Implementation
Compaction tombstone rule
the line between 'safe to drop' and 'resurrection bug'
1def can_drop_tombstone(t, compaction_inputs, level):
2 if level == BOTTOM_LEVEL: return True # nothing deeper
3 for deeper_level in levels_below(level):
4 if deeper_level.has_overlapping(t.key):
5 return False # older live may exist
6 return True
7
8# the WRONG rule (naive — used in scene's storyboard):
9def naive_can_drop(t, inputs):
10 return not any(r.key == t.key for r in inputs)
Range tombstone (variant)
delete a key range without writing per-key tombstones
1# RocksDB DeleteRange — one record covers many keys:
2range_tombstone = (begin_key, end_key, seqno)
3# stored in a separate meta-block per SSTable
4# read path: check both per-key tombstones AND range tombstones
5# subject to the SAME retention rule across levels.
6
7# LevelDB has no DeleteRange; it must DELETE each key individually.