Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 01 · Five million books, under 100 ms
5M books, a search box, and a 100ms budget — the naive scan-every-row plan never crosses the line, no matter the cores or the disk.
Scene 01
Five million books, under 100 ms
Diagram
A SQL bar at the top runs LIKE '%distributed systems%' against five million books. The vertical stack shows the five canonical rows; below them, a dimmed strip stands in for the other 4,999,995. A blue read head sweeps top-to-bottom, examining each row in turn. On the right, three meters track the cost in real time: the wall clock against the 100 ms target, rows scanned out of 5,000,000, and how many matches have actually been found.
SQLSELECT * FROM books WHERE body LIKE '%distributed systems%'#1The Cat in the HatDr. Seuss#2Distributed SystemsMaarten van Steen#3Designing Data-Intensive ApplicationsMartin Kleppmann#4Lucene in ActionOtis Gospodnetic#5The Old Man and the SeaErnest Hemingway… 4,999,995 more rows(each one will be examined character by character)read headWALL CLOCK0 mselapsed100 ms targetROWS SCANNED0of 5,000,000MATCHES0rows where body LIKE matchedLEVERS (visible only — outcome unchanged)b-treecores: 1limit: all …NVMeScanning… one row at a time.
The query 'distributed systems' starts a row-by-row sweep over 5,000,000 books. Watch the wall-clock meter and the rows-scanned meter climb together — and watch the 100 ms line fly by.
Implementation
naive_scan(query)
row-by-row LIKE '%...%' — every byte of every body is examined
1def naive_scan(query):
2 results = []
3 for book in books: # 5,000,000 rows
4 if not contains(book.body, query.terms):
5 continue # most rows go here
6 results.append(book)
7 return results # cost: O(rows), not O(matches)