All scenes
Build a vector database (Pinecone / Weaviate / pgvector style)
15 scenes · ~105 min · build the primitive
Build your own vector database (Pinecone / Weaviate / pgvector style)
Approximate nearest-neighbor over a billion 1536-dim vectors in 10 ms. Build the index from scratch (HNSW, IVF, PQ), pay the recall-vs-latency tax explicitly, support filtered + hybrid search, and feel why every LLM stack in 2026 has a vector store next to its KV store.
- 01Find the most similar thing — fastTurn items into lists of numbers and 'most similar' becomes 'closest list' — but checking every one is exact and hopeless at a billion. That wall is why this whole system exists.~7 min
- 02The vector and the distance metric'Closest' isn't one thing — L2 is the gap between arrowtips, cosine is the angle — and a metric that mismatches the embedding silently returns garbage. Normalize once and they agree.~7 min
- 03The Flat index — the 100%-recall baselineWrap the brute-force scan as an index called Flat: recall 1.0 by definition. It's the yardstick every faster index is scored against — and it defines the two axes, recall@k and latency.~7 min
- 04IVF — carve the space into cellsk-means splits the space into Voronoi cells; at query time scan only the nprobe cells nearest the query — turning an all-N scan into a fraction of it. The first explicit recall-for-speed trade.~7 min
- 05The boundary miss — one lonely pointA cell wall is hard: a true neighbor a hair across it is invisible at low nprobe, even though it's closer than points you do return. Proximity in space ≠ membership in a probed cell.~7 min
- 06HNSW — hop along a proximity graphLink each vector to its M nearest neighbors and search by greedy hops — no hard cell walls, so the lonely boundary point is just one more link away.~7 min
- 07Express lanes — layers and ef_searchSkip-list layers add express-lane nodes that cross huge distances in one hop, then descend for local refinement — roughly log search, with ef_search trading beam width for recall until it plateaus.~7 min
- 08Product Quantization — 6 KB to 96 bytesSplit each vector into subvectors and store the 1-byte id of each one's nearest prototype: 6 KB → ~96 bytes (~64×). The memory axis finally becomes a knob — at the cost of lossy distances.~7 min
- 09IVFPQ — skip most cells, compress the restStack IVF (skip most cells) and PQ (compress the residual): the FAISS billion-scale workhorse — tiny memory and a fraction-of-N scan, paid for with compounded approximation.~7 min
- 10The trilemma — pick twoRecall, latency, memory: Flat maxes recall, HNSW speed-at-recall, IVFPQ memory — no index wins all three. Every deployment is a chosen point on the Pareto frontier.~7 min
- 11Filtered search — pre vs postReal queries combine a metadata filter with similarity, and both naive orders break: post-filter can return fewer than k results; pre-filter sets up a subtler failure.~7 min
- 12When the filter disconnects the graphPre-filtering an HNSW search makes only matching nodes walkable — and if the path to a valid neighbor ran through a filtered-out node, the greedy walk dead-ends and recall craters.~7 min
- 13Hybrid search — fuse dense and sparse with RRFDense vectors capture meaning but miss exact tokens like SKUs and error codes; sparse keyword search nails them. Run both and fuse by rank with RRF — scale-free, no score calibration.~7 min
- 14Distribute it — shards, scatter-gather, the LLM stackShard vectors across nodes, scatter-gather every query, merge per-shard top-k — exact only if each shard over-fetches. This is the box that sits next to the KV store in every 2026 LLM app.~7 min
- 15Design your vector databaseCapstone: pick the index, filter, hybrid, and sharding for RAG vs recommendation vs semantic cache vs billion-on-a-budget — each knob traceable to the scene that justified it.~7 min