Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 03 · Segments — many small immutable indexes
Lucene appends a fresh tiny inverted index per write batch and merges in the background — readers never lock, deletes are bit flips, mutability is an asynchronous receipt.
Previously
The inverted index is beautiful when frozen and miserable when mutated. Every PUT would touch the term dictionary at random terms; every DELETE would shift posting lists. The fix is to never mutate.
Scene 03
Segments — many small immutable indexes
Diagram
Top-left: the IndexWriter BUFFER — an in-memory tray that catches PUTs as they arrive (not yet searchable). Center: the IMMUTABLE SEGMENT STACK — newest segment on top, each tile a tiny inverted index of its own (a few terms with their posting lists, a doc count, and — when the .liv overlay is on — a row of liveness bits, one per doc). Right: the MERGE controller, which only appears when a merge is in flight; it pulls 2 small segments out of the stack and emits one larger segment in their place. The three cadence tiles up top are placeholders for scene 4 — for now read them as 'when the buffer seals' and 'when the merge fires.'
New term this scene: SEGMENT — a small, immutable, self-contained inverted index. Watch the five canonical books arrive: books 1-3 fill the in-memory buffer and seal into seg_001, then books 4-5 arrive into a fresh buffer and seal into seg_002, then a merge controller fuses both into one larger segment. seg_001 and seg_002 are never edited — they fade out and a brand-new segment takes their place.
Implementation
indexer.seal_buffer
buffer fills → freeze → emit a new immutable segment
1def index(doc):2 buffer.append(doc) # in-memory only3 if buffer.full():4 seg = open_new_segment(next_id())5 for d in buffer.iter():6 seg.add_to_inverted_index(d)7 seg.seal() # immutable forever8 live_segments.append(seg) # readers pick it up on next snapshot9 buffer = new Buffer() # fresh tray for the next batch
indexer.delete
delete = flip a .liv bit; the segment is never rewritten
1def delete(doc_id):2 seg = segment_holding(doc_id)3 seg.liv.clear_bit(doc_id) # one bit, out-of-band4 # the inverted-index posting list is UNCHANGED.5 # search() will filter doc_id out via seg.liv on read.6 # bytes are reclaimed only when seg is merged.
merge_controller.tick
background: pick small segments, emit one bigger one, atomic swap
1def maybe_merge(live_segments):2 candidates = pick_small_adjacent(live_segments)3 if not candidates: return4 out = open_new_segment(next_id())5 for seg in candidates:6 for doc in seg.iter_live_docs(): # skips .liv-cleared docs7 out.add_to_inverted_index(doc)8 out.seal()9 atomic_swap(remove=candidates, add=[out]) # readers never see a mix