Five million books, under 100 ms

Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)

Scene 01 · Five million books, under 100 ms

5M books, a search box, and a 100ms budget — the naive scan-every-row plan never crosses the line, no matter the cores or the disk.

Scene 01

Watch

Diagram

A SQL bar at the top runs LIKE '%distributed systems%' against five million books. The vertical stack shows the five canonical rows; below them, a dimmed strip stands in for the other 4,999,995. A blue read head sweeps top-to-bottom, examining each row in turn. On the right, three meters track the cost in real time: the wall clock against the 100 ms target, rows scanned out of 5,000,000, and how many matches have actually been found.

The query 'distributed systems' starts a row-by-row sweep over 5,000,000 books. Watch the wall-clock meter and the rows-scanned meter climb together — and watch the 100 ms line fly by.

Implementation

naive_scan(query)

row-by-row LIKE '%...%' — every byte of every body is examined

1def naive_scan(query):
2    results = []
3    for book in books:                  # 5,000,000 rows
4        if not contains(book.body, query.terms):
5            continue                    # most rows go here
6        results.append(book)
7    return results                      # cost: O(rows), not O(matches)

NextThe inverted index — term to docs