Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 02 · The inverted index — term to docs
Flip the relation: a precomputed map from each word to the doc IDs it appears in turns O(documents) Boolean queries into O(matches) sorted-list merges.
Previously
Asking 'which rows contain this term?' needs a precomputed map from term to rows. Build it once, query it many times.
Scene 02
The inverted index — term to docs
Diagram
Left: the five canonical books with their bodies tokenised into pills. Right: the term dictionary, sorted alphabetically, with each term's posting list shown as a row of sorted doc-id chips. Bottom (when an AND query is active): a merge lane where two posting lists slide in and a cursor walks them in lock-step, emitting matches into a third row.
Five books, six terms each. Watch the term dictionary populate one term at a time, with each posting list showing exactly which doc ids contain that word. The walk ends on the scene-1 query 'distributed systems' — same query, two hash lookups, one merge.
Implementation
tokenize(book)
split body, lowercase, dedupe — runs once per doc, at index time
1def tokenize(book):2 raw = split_on_whitespace(book.body)3 terms = [lowercase(t) for t in raw]4 return set(terms) # dedupe within a doc
lookup(term)
one hash into the term dictionary returns a sorted posting list
1def lookup(term):2 if term not in dictionary:3 return []4 return dictionary[term] # sorted list of doc ids
and_merge(a, b)
two-pointer walk over sorted posting lists
1def and_merge(a, b):2 i, j, out = 0, 0, []3 while i < len(a) and j < len(b):4 if a[i] == b[j]: # match — emit and step both5 out.append(a[i])6 i += 1; j += 17 elif a[i] < b[j]:8 i += 1 # advance the smaller cursor9 else:10 j += 111 return out # bounded by min(|a|, |b|)