BM25 — TF saturation × IDF × length norm — Build a distributed search engine (Elasticsearch / OpenSearch style)

Build a distributed search engine (Elasticsearch / OpenSearch style) (12 scenes)

Scene 08 · BM25 — TF saturation × IDF × length norm

BM25 ranks each match by IDF × saturating-TF / length-norm — and the saturation knob k1 is exactly what stops a keyword-stuffed doc from winning the page.

Previously

Phase 1 of every search asked each shard to RANK its local matches. The function each shard runs is BM25 — and its three constants (k1, b, IDF) decide whose match wins.

Scene 09

BM25 — TF saturation × IDF × length-norm

Watch

Diagram

Each row is one book scored against the query at top. Stacked bars decompose its BM25 score into three components: IDF (rarity), TF (saturating term frequency), and length-norm (a long doc loses points). Books with no query-term overlap stay at score 0. The right-side panel shows the two tunables k1 and b with their saturation/length curves.

IDF(distributed) ≈ ln((5 − 2 + 0.5)/(2 + 0.5) + 1) ≈ 0.85

Book 2 wins: short doc (length-norm boost) + both terms in title.

Score the query 'distributed systems' against the five-book catalog. IDF is computed across the corpus (rare terms win); TF saturates so the 100th match barely beats the 10th; length-norm pushes long docs down. Books 1, 4, and 5 don't contain either query term — they sit at score 0.

Implementation

BM25.score

sum per-term: idf x saturating_tf

1def score(doc, query):
2    s = 0.0
3    for term in query.terms:
4        tf = doc.term_freq(term)
5        if tf == 0: continue
6        df = corpus.doc_freq(term)
7        s += idf(term, N=corpus.size, df=df)
8             * saturating_tf(tf, k1, b,
9                 doc_len=doc.length, avgdl=corpus.avgdl)
10    return s

BM25.idf

rarity weight — ln((N - df + 0.5)/(df + 0.5) + 1)

1def idf(term, N, df):
2    return ln((N - df + 0.5) / (df + 0.5) + 1)

BM25.saturating_tf

k1 caps repeats; b scales length penalty

1def saturating_tf(tf, k1, b, doc_len, avgdl):
2    len_norm = (1 - b) + b * (doc_len / avgdl)
3    return tf * (k1 + 1) / (tf + k1 * len_norm)

PreviousScatter, reduce, fetch NextDistributed scoring is approximate