Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 09 · Distributed scoring is approximate
Every shard computes IDF from its own local corpus, so the same document scores differently depending on where it lives — dfs_query_then_fetch buys correctness for one extra round trip.
Previously

Phase 1 of every search asked each shard to BM25-rank its local matches. The IDF those scores depend on is also LOCAL — and that's where 'same data, same query, same answer' quietly stops being true.

Scene 09
Distributed scoring is approximate
Diagram
A coordinator at the top fans the query 'distributed' out to three shards below. Each shard carries a local-stats badge — N (docs in shard) and df (docs in shard containing the term). The coordinator merges the per-shard candidate lists into the global top-N. When dfs_query_then_fetch is on, an extra Phase-0 ribbon appears: the coordinator first asks every shard for its (N, df), sums them globally, and ships those unified stats with the query.
COORDINATORCoordinatorphase 2 · searchfetchfetchfetchshard-0top-K candsLOCAL STATSn=1000000 df=80000doc 10219.4doc 10315.8shard-1top-K candsLOCAL STATSn=1000000 df=1000doc 10253.1doc 20749.3shard-2top-K candsLOCAL STATSn=1000000 df=50000doc 31124.2doc 31813.6GLOBAL TOP-N · merged at coordinatorscore#1 doc 10253.1 shard-1#2 doc 20749.3 shard-1#3 doc 31124.2 shard-2#4 doc 10219.4 shard-0#5 doc 10315.8 shard-0(no pagination meter)Each shard scores with LOCAL (N, df). Same doc text, different shards → different scores.
Three shards. Each one BM25-ranks its local matches for 'distributed' using its OWN df. Look at the per-shard df badges — shard 0 has 80k matches, shard 1 has just 1k, shard 2 has 50k. Same query, same term, three different IDFs.
Implementation
Shard.localScore
BM25 with PER-SHARD (N, df) — the source of drift
1def local_score(doc, query, shard_stats):
2 # PER SHARD: N_local, df_local — never cluster-wide by default
3 N_local = shard_stats.N
4 df_local = shard_stats.df[query.term]
5 idf = ln((N_local - df_local + 0.5) / (df_local + 0.5) + 1)
6 tf_norm = doc.tf / (doc.tf + 1.2)
7 return tf_norm * idf