Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 01 · Five million books, under 100 ms
5M books, a search box, and a 100ms budget — the naive scan-every-row plan never crosses the line, no matter the cores or the disk.
Scene 01
Five million books, under 100 ms
Diagram
A SQL bar at the top runs LIKE '%distributed systems%' against five million books. The vertical stack shows the five canonical rows; below them, a dimmed strip stands in for the other 4,999,995. A blue read head sweeps top-to-bottom, examining each row in turn. On the right, three meters track the cost in real time: the wall clock against the 100 ms target, rows scanned out of 5,000,000, and how many matches have actually been found.
The query 'distributed systems' starts a row-by-row sweep over 5,000,000 books. Watch the wall-clock meter and the rows-scanned meter climb together — and watch the 100 ms line fly by.
Implementation
naive_scan(query)
row-by-row LIKE '%...%' — every byte of every body is examined
1def naive_scan(query):2 results = []3 for book in books: # 5,000,000 rows4 if not contains(book.body, query.terms):5 continue # most rows go here6 results.append(book)7 return results # cost: O(rows), not O(matches)