Hybrid search — fuse dense and sparse with RRF — Build a vector database (Pinecone / Weaviate / pgvector style)

Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)

Scene 13 · Hybrid search — fuse dense and sparse with RRF

Dense vectors capture meaning but miss exact tokens like SKUs and error codes; sparse keyword search nails them. Run both and fuse by rank with RRF — scale-free, no score calibration.

Previously

Filters constrain WHICH items are eligible; they don't fix similarity's own blind spot. Dense vectors understand meaning but miss exact tokens, while keyword search is the mirror image — so hybrid search runs both and fuses by rank with RRF, getting paraphrase and exact-token matches at once. We now have a feature-complete search on one machine. The last wall is the one from scene 1: one machine can't hold a billion vectors.

Scene 13

Hybrid search — fuse dense and sparse with RRF

Watch

Diagram

Two ranked lists flow into one. The LEFT lane is dense (vector) search — it ranks by meaning, so a paraphrase of the query scores well even without the exact word. The RIGHT lane is sparse (keyword/BM25) search — it ranks by literal token overlap. The CENTER column is the fused list: every document earns 1/(k+rank) from each list it appears in (k=60), those are summed, and the documents are re-sorted by that total — so a document both retrievers liked beats a document that topped only one list.

Sources

dense = ranks by meaning →

← sparse = ranks by exact tokens

each list has a blind spot

One search box has to handle two kinds of query: plain English ('comfy gaming chair') and exact strings ('SKU-44871'). Watch what happens to the query 'cybersport desk' under two different retrievers. On the LEFT, *dense* search turns the query into a vector and ranks by MEANING — it returns 'Gaming desk' and 'Esports table' because they mean the same thing, even though neither contains the word 'cybersport'. On the RIGHT, *sparse* search ranks by the literal TOKENS — it returns 'Standing desk', 'Office desk', 'Desk lamp' because they all contain the word 'desk'. Notice the blind spots: dense never returns the plain literal-token row, and sparse never returns the paraphrase 'Esports table'. Neither list alone is right.

Implementation

rrf_fuse

merge ranked lists by reciprocal rank — scale-free

1def rrf_fuse(ranked_lists, k=60):
2    score = defaultdict(float)
3    for ranking in ranked_lists:        # dense, sparse, ...
4        for rank, doc in enumerate(ranking, start=1):
5            score[doc] += 1.0 / (k + rank)   # rank, not score
6    return sorted(score, key=score.get, reverse=True)

naive_score_add

the broken fix — adds incomparable scales

1def naive_score_add(dense, sparse):
2    score = defaultdict(float)
3    for doc, s in dense.items():   # cosine ~0..1
4        score[doc] += s
5    for doc, s in sparse.items():  # BM25 ~0..30  <-- swamps cosine
6        score[doc] += s
7    return sorted(score, key=score.get, reverse=True)

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

PreviousWhen the filter disconnects the graph NextDistribute it — shards, scatter-gather, the LLM stack