Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)
Scene 08 · BM25 — TF saturation × IDF × length norm
BM25 ranks each match by IDF × saturating-TF / length-norm — and the saturation knob k1 is exactly what stops a keyword-stuffed doc from winning the page.
Previously
Phase 1 of every search asked each shard to RANK its local matches. The function each shard runs is BM25 — and its three constants (k1, b, IDF) decide whose match wins.
Scene 09
BM25 — TF saturation × IDF × length-norm
Diagram
Each row is one book scored against the query at top. Stacked bars decompose its BM25 score into three components: IDF (rarity), TF (saturating term frequency), and length-norm (a long doc loses points). Books with no query-term overlap stay at score 0. The right-side panel shows the two tunables k1 and b with their saturation/length curves.
IDF(distributed) ≈ ln((5 − 2 + 0.5)/(2 + 0.5) + 1) ≈ 0.85
Book 2 wins: short doc (length-norm boost) + both terms in title.
Score the query 'distributed systems' against the five-book catalog. IDF is computed across the corpus (rare terms win); TF saturates so the 100th match barely beats the 10th; length-norm pushes long docs down. Books 1, 4, and 5 don't contain either query term — they sit at score 0.
Implementation
BM25.score
sum per-term: idf x saturating_tf
1def score(doc, query):2 s = 0.03 for term in query.terms:4 tf = doc.term_freq(term)5 if tf == 0: continue6 df = corpus.doc_freq(term)7 s += idf(term, N=corpus.size, df=df)8 * saturating_tf(tf, k1, b,9 doc_len=doc.length, avgdl=corpus.avgdl)10 return s
BM25.idf
rarity weight — ln((N - df + 0.5)/(df + 0.5) + 1)
1def idf(term, N, df):2 return ln((N - df + 0.5) / (df + 0.5) + 1)
BM25.saturating_tf
k1 caps repeats; b scales length penalty
1def saturating_tf(tf, k1, b, doc_len, avgdl):2 len_norm = (1 - b) + b * (doc_len / avgdl)3 return tf * (k1 + 1) / (tf + k1 * len_norm)