The inverted index — term to docs — Build a distributed search engine (Elasticsearch / OpenSearch style)

Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)

Scene 02 · The inverted index — term to docs

Flip the relation: a precomputed map from each word to the doc IDs it appears in turns O(documents) Boolean queries into O(matches) sorted-list merges.

Previously

Asking 'which rows contain this term?' needs a precomputed map from term to rows. Build it once, query it many times.

Scene 02

The inverted index — term to docs

Watch

Diagram

Left: the five canonical books with their bodies tokenised into pills. Right: the term dictionary, sorted alphabetically, with each term's posting list shown as a row of sorted doc-id chips. Bottom (when an AND query is active): a merge lane where two posting lists slide in and a cursor walks them in lock-step, emitting matches into a third row.

Five books, six terms each. Watch the term dictionary populate one term at a time, with each posting list showing exactly which doc ids contain that word. The walk ends on the scene-1 query 'distributed systems' — same query, two hash lookups, one merge.

Implementation

tokenize(book)

split body, lowercase, dedupe — runs once per doc, at index time

1def tokenize(book):
2    raw = split_on_whitespace(book.body)
3    terms = [lowercase(t) for t in raw]
4    return set(terms)              # dedupe within a doc

lookup(term)

one hash into the term dictionary returns a sorted posting list

1def lookup(term):
2    if term not in dictionary:
3        return []
4    return dictionary[term]        # sorted list of doc ids

and_merge(a, b)

two-pointer walk over sorted posting lists

1def and_merge(a, b):
2    i, j, out = 0, 0, []
3    while i < len(a) and j < len(b):
4        if a[i] == b[j]:           # match — emit and step both
5            out.append(a[i])
6            i += 1; j += 1
7        elif a[i] < b[j]:
8            i += 1                 # advance the smaller cursor
9        else:
10            j += 1
11    return out                     # bounded by min(|a|, |b|)

PreviousFive million books, under 100 ms NextSegments — many small immutable indexes