Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 02 · The inverted index — term to docs
Flip the relation: a precomputed map from each word to the doc IDs it appears in turns O(documents) Boolean queries into O(matches) sorted-list merges.
Previously

Asking 'which rows contain this term?' needs a precomputed map from term to rows. Build it once, query it many times.

Scene 02
The inverted index — term to docs
Diagram
Left: the five canonical books with their bodies tokenised into pills. Right: the term dictionary, sorted alphabetically, with each term's posting list shown as a row of sorted doc-id chips. Bottom (when an AND query is active): a merge lane where two posting lists slide in and a cursor walks them in lock-step, emitting matches into a third row.
QUERYcatDOCUMENTS · tokenised#1The Cat in the Hatthecatinthehatclassic#2Distributed Systemsdistribut…systems#3Designing Data-Intensive Applicationsdesigningdataintensiveapplicati…distribut…#4Lucene in Actionluceneinactionsearchjava#5The Old Man and the SeatheoldmanseafictionclassicTERM DICTIONARY · sortedtermposting list (sorted doc ids)applications3cat1classic15data3designing3distributed23fiction5hat1lucene4sea5search4systems23Term 'cat' → posting list [1] (sorted doc ids)
Five books, six terms each. Watch the term dictionary populate one term at a time, with each posting list showing exactly which doc ids contain that word. The walk ends on the scene-1 query 'distributed systems' — same query, two hash lookups, one merge.
Implementation
tokenize(book)
split body, lowercase, dedupe — runs once per doc, at index time
1def tokenize(book):
2 raw = split_on_whitespace(book.body)
3 terms = [lowercase(t) for t in raw]
4 return set(terms) # dedupe within a doc
lookup(term)
one hash into the term dictionary returns a sorted posting list
1def lookup(term):
2 if term not in dictionary:
3 return []
4 return dictionary[term] # sorted list of doc ids
and_merge(a, b)
two-pointer walk over sorted posting lists
1def and_merge(a, b):
2 i, j, out = 0, 0, []
3 while i < len(a) and j < len(b):
4 if a[i] == b[j]: # match — emit and step both
5 out.append(a[i])
6 i += 1; j += 1
7 elif a[i] < b[j]:
8 i += 1 # advance the smaller cursor
9 else:
10 j += 1
11 return out # bounded by min(|a|, |b|)