Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)
Scene 13 · Hybrid search — fuse dense and sparse with RRF
Dense vectors capture meaning but miss exact tokens like SKUs and error codes; sparse keyword search nails them. Run both and fuse by rank with RRF — scale-free, no score calibration.
Previously
Filters constrain WHICH items are eligible; they don't fix similarity's own blind spot. Dense vectors understand meaning but miss exact tokens, while keyword search is the mirror image — so hybrid search runs both and fuses by rank with RRF, getting paraphrase and exact-token matches at once. We now have a feature-complete search on one machine. The last wall is the one from scene 1: one machine can't hold a billion vectors.
Scene 13
Hybrid search — fuse dense and sparse with RRF
Diagram
Two ranked lists flow into one. The LEFT lane is dense (vector) search — it ranks by meaning, so a paraphrase of the query scores well even without the exact word. The RIGHT lane is sparse (keyword/BM25) search — it ranks by literal token overlap. The CENTER column is the fused list: every document earns 1/(k+rank) from each list it appears in (k=60), those are summed, and the documents are re-sorted by that total — so a document both retrievers liked beats a document that topped only one list.
dense = ranks by meaning →
← sparse = ranks by exact tokens
each list has a blind spot
One search box has to handle two kinds of query: plain English ('comfy gaming chair') and exact strings ('SKU-44871'). Watch what happens to the query 'cybersport desk' under two different retrievers. On the LEFT, *dense* search turns the query into a vector and ranks by MEANING — it returns 'Gaming desk' and 'Esports table' because they mean the same thing, even though neither contains the word 'cybersport'. On the RIGHT, *sparse* search ranks by the literal TOKENS — it returns 'Standing desk', 'Office desk', 'Desk lamp' because they all contain the word 'desk'. Notice the blind spots: dense never returns the plain literal-token row, and sparse never returns the paraphrase 'Esports table'. Neither list alone is right.
Implementation
rrf_fuse
merge ranked lists by reciprocal rank — scale-free
1def rrf_fuse(ranked_lists, k=60):2 score = defaultdict(float)3 for ranking in ranked_lists: # dense, sparse, ...4 for rank, doc in enumerate(ranking, start=1):5 score[doc] += 1.0 / (k + rank) # rank, not score6 return sorted(score, key=score.get, reverse=True)
naive_score_add
the broken fix — adds incomparable scales
1def naive_score_add(dense, sparse):2 score = defaultdict(float)3 for doc, s in dense.items(): # cosine ~0..14 score[doc] += s5 for doc, s in sparse.items(): # BM25 ~0..30 <-- swamps cosine6 score[doc] += s7 return sorted(score, key=score.get, reverse=True)
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.