Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 09 · Distributed scoring is approximate
Every shard computes IDF from its own local corpus, so the same document scores differently depending on where it lives — dfs_query_then_fetch buys correctness for one extra round trip.
Previously
Phase 1 of every search asked each shard to BM25-rank its local matches. The IDF those scores depend on is also LOCAL — and that's where 'same data, same query, same answer' quietly stops being true.
Scene 09
Distributed scoring is approximate
Diagram
A coordinator at the top fans the query 'distributed' out to three shards below. Each shard carries a local-stats badge — N (docs in shard) and df (docs in shard containing the term). The coordinator merges the per-shard candidate lists into the global top-N. When dfs_query_then_fetch is on, an extra Phase-0 ribbon appears: the coordinator first asks every shard for its (N, df), sums them globally, and ships those unified stats with the query.
Three shards. Each one BM25-ranks its local matches for 'distributed' using its OWN df. Look at the per-shard df badges — shard 0 has 80k matches, shard 1 has just 1k, shard 2 has 50k. Same query, same term, three different IDFs.
Implementation
Shard.localScore
BM25 with PER-SHARD (N, df) — the source of drift
1def local_score(doc, query, shard_stats):2 # PER SHARD: N_local, df_local — never cluster-wide by default3 N_local = shard_stats.N4 df_local = shard_stats.df[query.term]5 idf = ln((N_local - df_local + 0.5) / (df_local + 0.5) + 1)6 tf_norm = doc.tf / (doc.tf + 1.2)7 return tf_norm * idf