Build your own vector database (Pinecone / Weaviate / pgvector style)

Approximate nearest-neighbor over a billion 1536-dim vectors in 10 ms. Build the index from scratch (HNSW, IVF, PQ), pay the recall-vs-latency tax explicitly, support filtered + hybrid search, and feel why every LLM stack in 2026 has a vector store next to its KV store.

01
Find the most similar thing — fast
Turn items into lists of numbers and 'most similar' becomes 'closest list' — but checking every one is exact and hopeless at a billion. That wall is why this whole system exists.
~7 min
02
The vector and the distance metric
'Closest' isn't one thing — L2 is the gap between arrowtips, cosine is the angle — and a metric that mismatches the embedding silently returns garbage. Normalize once and they agree.
~7 min
03
The Flat index — the 100%-recall baseline
Wrap the brute-force scan as an index called Flat: recall 1.0 by definition. It's the yardstick every faster index is scored against — and it defines the two axes, recall@k and latency.
~7 min
04
IVF — carve the space into cells
k-means splits the space into Voronoi cells; at query time scan only the nprobe cells nearest the query — turning an all-N scan into a fraction of it. The first explicit recall-for-speed trade.
~7 min
05
The boundary miss — one lonely point
A cell wall is hard: a true neighbor a hair across it is invisible at low nprobe, even though it's closer than points you do return. Proximity in space ≠ membership in a probed cell.
~7 min
06
HNSW — hop along a proximity graph
Link each vector to its M nearest neighbors and search by greedy hops — no hard cell walls, so the lonely boundary point is just one more link away.
~7 min
07
Express lanes — layers and ef_search
Skip-list layers add express-lane nodes that cross huge distances in one hop, then descend for local refinement — roughly log search, with ef_search trading beam width for recall until it plateaus.
~7 min
08
Product Quantization — 6 KB to 96 bytes
Split each vector into subvectors and store the 1-byte id of each one's nearest prototype: 6 KB → ~96 bytes (~64×). The memory axis finally becomes a knob — at the cost of lossy distances.
~7 min
09
IVFPQ — skip most cells, compress the rest
Stack IVF (skip most cells) and PQ (compress the residual): the FAISS billion-scale workhorse — tiny memory and a fraction-of-N scan, paid for with compounded approximation.
~7 min
10
The trilemma — pick two
Recall, latency, memory: Flat maxes recall, HNSW speed-at-recall, IVFPQ memory — no index wins all three. Every deployment is a chosen point on the Pareto frontier.
~7 min
11
Filtered search — pre vs post
Real queries combine a metadata filter with similarity, and both naive orders break: post-filter can return fewer than k results; pre-filter sets up a subtler failure.
~7 min
12
When the filter disconnects the graph
Pre-filtering an HNSW search makes only matching nodes walkable — and if the path to a valid neighbor ran through a filtered-out node, the greedy walk dead-ends and recall craters.
~7 min
13
Hybrid search — fuse dense and sparse with RRF
Dense vectors capture meaning but miss exact tokens like SKUs and error codes; sparse keyword search nails them. Run both and fuse by rank with RRF — scale-free, no score calibration.
~7 min
14
Distribute it — shards, scatter-gather, the LLM stack
Shard vectors across nodes, scatter-gather every query, merge per-shard top-k — exact only if each shard over-fetches. This is the box that sits next to the KV store in every 2026 LLM app.
~7 min
15
Design your vector database
Capstone: pick the index, filter, hybrid, and sharding for RAG vs recommendation vs semantic cache vs billion-on-a-budget — each knob traceable to the scene that justified it.
~7 min