Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 08 · BM25 — TF saturation × IDF × length norm
BM25 ranks each match by IDF × saturating-TF / length-norm — and the saturation knob k1 is exactly what stops a keyword-stuffed doc from winning the page.
Previously

Phase 1 of every search asked each shard to RANK its local matches. The function each shard runs is BM25 — and its three constants (k1, b, IDF) decide whose match wins.

Scene 09
BM25 — TF saturation × IDF × length-norm
Diagram
Each row is one book scored against the query at top. Stacked bars decompose its BM25 score into three components: IDF (rarity), TF (saturating term frequency), and length-norm (a long doc loses points). Books with no query-term overlap stay at score 0. The right-side panel shows the two tunables k1 and b with their saturation/length curves.
QUERYdistributed systemsscore = Σ IDF(q) · tf·(k1+1) / (tf + k1·(1 − b + b·|D|/avgdl))PER-DOC SCORE BREAKDOWN · rankedIDFTF (sat)len-normDistributed Systemsbook-21.81.12.9Designing Data-Intensive …book-31.80.91.8The Cat in the Hatbook-10.0Lucene in Actionbook-40.0The Old Man and the Seabook-50.0TUNABLESk1 (TF saturation)1.20b (length norm)0.75default 1.2 / 0.75N=5 · df(distributed)=2 · df(systems)=2 · avgdl≈62
IDF(distributed) ≈ ln((5 − 2 + 0.5)/(2 + 0.5) + 1) ≈ 0.85
Book 2 wins: short doc (length-norm boost) + both terms in title.
Score the query 'distributed systems' against the five-book catalog. IDF is computed across the corpus (rare terms win); TF saturates so the 100th match barely beats the 10th; length-norm pushes long docs down. Books 1, 4, and 5 don't contain either query term — they sit at score 0.
Implementation
BM25.score
sum per-term: idf x saturating_tf
1def score(doc, query):
2 s = 0.0
3 for term in query.terms:
4 tf = doc.term_freq(term)
5 if tf == 0: continue
6 df = corpus.doc_freq(term)
7 s += idf(term, N=corpus.size, df=df)
8 * saturating_tf(tf, k1, b,
9 doc_len=doc.length, avgdl=corpus.avgdl)
10 return s
BM25.idf
rarity weight — ln((N - df + 0.5)/(df + 0.5) + 1)
1def idf(term, N, df):
2 return ln((N - df + 0.5) / (df + 0.5) + 1)
BM25.saturating_tf
k1 caps repeats; b scales length penalty
1def saturating_tf(tf, k1, b, doc_len, avgdl):
2 len_norm = (1 - b) + b * (doc_len / avgdl)
3 return tf * (k1 + 1) / (tf + k1 * len_norm)