Build a Bitcask-style KV store (9 scenes)
Scene 07 · Tombstones — deletes that come back
If merge GCs a tombstone before every older segment containing the key has been merged away, restart resurrects the deleted key. The riak_kv #925 fix.
Previously

fsync chose how much of the tail to keep. Merge chose how much of the dead to reclaim. The collision between those two policies is where deleted keys come back from the dead.

Scene 07
Tombstones — deletes that come back
Diagram
Five frames left to right: PUT, DELETE, merge, crash, restart. Slot picks the rule; final GET shows v1 (wrong) or not_found (ok).
Bitcask tombstone — step 1/5: PUT k=v1Storyboard of the tombstone resurrection bug. Merge rule is naive; delete mode is immediate; gap between delete and merge is immediate.1PUT k=v1appends to active file …data 003k=v1KEYDIRk003@642DELETE ktombstone in 007 · k re…data 003k=v1data 005a=alpb=betdata 007a=alpk=⊥KEYDIRa007@0b005@323Merge 005-007naive: no live k in win…data 003 (untouched)k=v1data 005reclaimeda=alpb=betdata 007reclaimeda=alpk=⊥merged M-1a=alpb=betactive 008c=gamKEYDIRaM-1@0bM-1@24rule: naive (drops tomb)4Crashkeydir is RAM only · di…Step 4 introduces the resurrection bugUH OHdata 003k=v1merged M-1a=alpb=betKEYDIR(empty)5Restartscan replays k=v1 from …Step 5 introduces the resurrection bugUH OHdata 003 (replayed)k=v1merged M-1 (replayed)a=alpb=betKEYDIRk003@64aM-1@0bM-1@24MERGE BEHAVIOURrule: naiverulenaivetombstone2delete_mode: immediatedelete_modeimmediatekeepgap: immediategapimmediatedelayednaive merge drops tombstones; delete_mode=immediate forgets the tomb fast;…naive merge drops tombstones; delete_mode=immediate forgets the tomb fast; immediate merge runs before the tomb can land.FINAL GET k(run all 5 steps to see the result)Slot 0 — naive merge drops the tombstone and resurrects k · step 1/5 — PUT k=v1
Naive merge rule. Five steps animate left to right, ~2s each: PUT, DELETE, merge, crash, restart. Step 5 is the punchline — restart rebuilds the keydir from disk and the deleted key reappears with a red WRONG badge.
Implementation
Bitcask.delete
delete is just another append — a tombstone record. Highlight indices are 0-based.
1def delete(key):
2 tomb = Record(
3 key = key,
4 value = TOMBSTONE_MARKER, # vsz = 0
5 tstamp = now(),
6 )
7 active_file.append(tomb)
8 keydir.pop(key, None) # RAM only — disk still holds older PUT
Merge.naive
scan a window of immutable files, keep only what keydir points at. Highlight indices are 0-based.
1def merge(window_files, keydir):
2 out = new_data_file()
3 for f in window_files:
4 for rec in f.scan():
5 if rec.is_tombstone:
6 # key already gone from keydir → assume redundant
7 if rec.key not in keydir:
8 continue # drop the tombstone
9 elif keydir[rec.key] != (f.id, rec.offset):
10 continue # stale PUT, shadowed by newer record
11 out.append(rec)
12 atomic_swap(window_files, out)
Merge.tombstone2
riak_kv #925 — re-append the tombstone into the oldest file still mentioning the key. Highlight indices are 0-based.
1def merge(window_files, keydir, all_files):
2 out = new_data_file()
3 for f in window_files:
4 for rec in f.scan():
5 if rec.is_tombstone:
6 older = oldest_file_with(rec.key, all_files)
7 if older and older not in window_files:
8 older.append(rec) # carry tombstone backwards
9 continue
10 # safe to drop: no older file mentions this key
11 continue
12 elif keydir[rec.key] != (f.id, rec.offset):
13 continue
14 out.append(rec)
15 atomic_swap(window_files, out)
Bitcask.startup_replay
rebuild the keydir by scanning every immutable file in tstamp order. Highlight indices are 0-based.
1def open(dir):
2 keydir = {}
3 for f in sorted(dir.data_files(), key=lambda f: f.tstamp):
4 for rec in f.scan():
5 if rec.is_tombstone:
6 keydir.pop(rec.key, None)
7 else:
8 keydir[rec.key] = (f.id, rec.offset, rec.tstamp)
9 return keydir # whichever record wins by tstamp wins on disk