Build an LSM-tree storage engine (LevelDB / RocksDB style) (11 scenes)
Scene 06 · Compaction — the GC of sorted files
A background process merges N immutable SSTables into 1, drops shadowed/deleted records, atomically swaps files. Foreground writes never block.
Previously
Bloom bounded the cost of LOOKING at files. It didn't bound how many there are. Every flush adds another. Sooner or later, we have to remove files — not just skip them.
Scene 06
Compaction — the GC of sorted files
Diagram
Top: memtable (in-RAM sorted table, scene 2) + WAL (write-ahead log, scene 2) keep accepting writes — a 'writes during compaction' counter ticks the whole time. The SSTable shelf shows immutable sorted files (scene 3); a COMPACTOR badge (the new word's worker — a background process that picks several files and merges them) selects N input files (highlighted outline), runs a merge-sort cursor across them (a sorted-merge that emits each key once at its newest value), and emits one new SSTable. Old inputs fade and disappear. The 'atomic file swap' (when compaction's output replaces its inputs in one tick — no half-state where some readers see old and some see new) is the load-bearing safety property.
Five SSTables on the shelf, three of them holding the same key. Watch the compactor select them, run a merge-sort, and emit a single new file. Foreground writes keep going; the writes-during-compaction counter never stops.
Implementation
lsm.compact
merge-sort N inputs → 1 output, drop shadowed/tombstones, atomic swap
1def compact(inputs: list[SSTable]) -> SSTable:2 out = open_new_sstable(next_id())3 cursors = [s.iter_sorted() for s in inputs]4 while any(cursors):5 (key, val, seqno) = min_record(cursors)6 if newer_record_exists(key, val, seqno):7 continue # shadowed — drop8 if val.is_tombstone and at_bottom_level:9 continue # tombstone GC (scene 9)10 out.append(key, val)11 out.fsync()12 atomic_swap_files(remove=inputs, add=[out]) # one tick13 return out
Foreground writes are unaffected
puts always land in the memtable + WAL
1def put(key, val):2 # NOTE: no awareness of compaction whatsoever3 wal.append(key, val); fsync_if_policy(wal)4 memtable.insert_sorted(key, val)5 return ok6# inputs are immutable; outputs land via atomic_swap.7# readers see EITHER the old file list OR the new — never a mix.