Build a Bitcask-style KV store (9 scenes)
Scene 07 · Tombstones — deletes that come back
If merge GCs a tombstone before every older segment containing the key has been merged away, restart resurrects the deleted key. The riak_kv #925 fix.
Previously
fsync chose how much of the tail to keep. Merge chose how much of the dead to reclaim. The collision between those two policies is where deleted keys come back from the dead.
Scene 07
Tombstones — deletes that come back
Diagram
Five frames left to right: PUT, DELETE, merge, crash, restart. Slot picks the rule; final GET shows v1 (wrong) or not_found (ok).
Naive merge rule. Five steps animate left to right, ~2s each: PUT, DELETE, merge, crash, restart. Step 5 is the punchline — restart rebuilds the keydir from disk and the deleted key reappears with a red WRONG badge.
Implementation
Bitcask.delete
delete is just another append — a tombstone record. Highlight indices are 0-based.
1def delete(key):2 tomb = Record(3 key = key,4 value = TOMBSTONE_MARKER, # vsz = 05 tstamp = now(),6 )7 active_file.append(tomb)8 keydir.pop(key, None) # RAM only — disk still holds older PUT
Merge.naive
scan a window of immutable files, keep only what keydir points at. Highlight indices are 0-based.
1def merge(window_files, keydir):2 out = new_data_file()3 for f in window_files:4 for rec in f.scan():5 if rec.is_tombstone:6 # key already gone from keydir → assume redundant7 if rec.key not in keydir:8 continue # drop the tombstone9 elif keydir[rec.key] != (f.id, rec.offset):10 continue # stale PUT, shadowed by newer record11 out.append(rec)12 atomic_swap(window_files, out)
Merge.tombstone2
riak_kv #925 — re-append the tombstone into the oldest file still mentioning the key. Highlight indices are 0-based.
1def merge(window_files, keydir, all_files):2 out = new_data_file()3 for f in window_files:4 for rec in f.scan():5 if rec.is_tombstone:6 older = oldest_file_with(rec.key, all_files)7 if older and older not in window_files:8 older.append(rec) # carry tombstone backwards9 continue10 # safe to drop: no older file mentions this key11 continue12 elif keydir[rec.key] != (f.id, rec.offset):13 continue14 out.append(rec)15 atomic_swap(window_files, out)
Bitcask.startup_replay
rebuild the keydir by scanning every immutable file in tstamp order. Highlight indices are 0-based.
1def open(dir):2 keydir = {}3 for f in sorted(dir.data_files(), key=lambda f: f.tstamp):4 for rec in f.scan():5 if rec.is_tombstone:6 keydir.pop(rec.key, None)7 else:8 keydir[rec.key] = (f.id, rec.offset, rec.tstamp)9 return keydir # whichever record wins by tstamp wins on disk