Build a Bitcask-style KV store (9 scenes)
Scene 06 · Crash recovery — scan or hint
On startup the keydir is gone (RAM is gone). Hint files (a merge byproduct) drop cold-start time ~10x; per-record CRCs make a torn tail self-truncating.
Previously

Merge produced a hint file as a byproduct. That byproduct is the difference between a 30-minute restart and a 3-minute one — and the per-record CRC is what makes the surviving log parseable in either case.

Scene 06
Crash recovery — scan or hint
Diagram
Left: data files (.hint siblings drop scan cost ~100×). Right: keydir rebuilds row-by-row. Bottom: time-to-ready meter with no-hints baseline comparator.
Bitcask recovery — scanning data files. 0 B of 5.40 GB scanned. Elapsed 0 ms.
DATA FILES ON DISK001.data · 2.00 GB · 524,288 records · no hint (full crawl) · full crawl001.data2.00 GB · 524,288 recNo .hint sidecar — the recovery scanner must crawl every record in this file to rebuild keydir entries.no hint · full crawlfull crawl002.data · 2.00 GB · 524,288 records · .hint present (metadata-only) · pending002.data2.00 GB · 524,288 rec.hint sidecar present — keydir is rebuilt from metadata only, no value scan..hint · metadata onlypending003.data (active) · 1.40 GB · 320,000 records · no hint (full crawl) · pending003.data …1.40 GB · 320,000 recNo .hint sidecar — the recovery scanner must crawl every record in this file to rebuild keydir entries.no hint · full crawlCRC mismatch — last record's checksum didn't match, tail is torn and truncatedCRC ✗ TRUNCATEDpendingKEYDIR (rebuilding)keyfileoffset0 rowsTIME TO READY0 msgreen = w/ hints (7.2 s)red = no hints (11.3 s)BYTES SCANNED0 B/ 5.40 GBdataset: 5.40 GBScanning 001 (full crawl)
The process was killed mid-write. Watch the startup scanner sweep each file: 001 has no hint sibling so it crawls record-by-record, 002 has a .hint file from merge so it flies through metadata only, and 003 (the active file) hits a CRC mismatch on its torn tail and truncates to the last valid offset. The keydir rebuilds on the right, with later writes overwriting earlier ones for the same key. Every record carries a per-record CRC32 — that's why the surviving log is always parseable even after a hard crash.
Implementation
Bitcask.rebuild_keydir
the open() path that runs before the first GET can be served
1def rebuild_keydir(data_dir):
2 keydir = {} # empty — RAM is gone
3 files = sorted(data_dir.list('*.data'))
4 for f in files:
5 hint = data_dir.path(f.id + '.hint')
6 if hint.exists():
7 entries = read_hints(hint) # metadata only
8 else:
9 entries = scan_data_file(f) # full crawl
10 for e in entries:
11 # later tstamp wins on overwrite
12 if e.key not in keydir or e.tstamp > keydir[e.key].tstamp:
13 keydir[e.key] = (f.id, e.vsz, e.vpos, e.tstamp)
14 return keydir
Bitcask.scan_data_file
full crawl with torn-tail truncation via per-record CRC
1def scan_data_file(f):
2 pos = 0
3 while pos < f.size:
4 header = f.pread(pos, HEADER_SIZE)
5 crc, tstamp, ksz, vsz = unpack(header)
6 kv = f.pread(pos + HEADER_SIZE, ksz + vsz)
7 if crc32(header[4:] + kv) != crc:
8 f.truncate(pos) # drop torn tail
9 return # stop, log ends here
10 key = kv[:ksz]
11 vpos = pos + HEADER_SIZE + ksz
12 yield Entry(key, tstamp, vsz, vpos)
13 pos += HEADER_SIZE + ksz + vsz
OnDisk.record_formats
data records carry values; hint records omit them
1# Data record — written by every put() to the active file:
2data_record = struct(
3 crc: u32, # checksum over (tstamp..value)
4 tstamp: u64,
5 ksz: u32,
6 vsz: u32,
7 key: bytes[ksz],
8 value: bytes[vsz], # the value bytes themselves
9)
10
11# Hint record — written by merge alongside each merged data file:
12hint_record = struct(
13 tstamp: u64,
14 ksz: u32,
15 vsz: u32,
16 vpos: u64, # where the value lives in the .data file
17 key: bytes[ksz],
18 # no value bytes — that is the entire point
19)