Build a vector database (Pinecone / Weaviate / pgvector style)
15 scenes · ~105 min · build the primitive

Build your own vector database (Pinecone / Weaviate / pgvector style)

Approximate nearest-neighbor over a billion 1536-dim vectors in 10 ms. Build the index from scratch (HNSW, IVF, PQ), pay the recall-vs-latency tax explicitly, support filtered + hybrid search, and feel why every LLM stack in 2026 has a vector store next to its KV store.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06
  7. 07
  8. 08
  9. 09
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  1. 01
    Find the most similar thing — fast
    Turn items into lists of numbers and 'most similar' becomes 'closest list' — but checking every one is exact and hopeless at a billion. That wall is why this whole system exists.
    ~7 min
  2. 02
    The vector and the distance metric
    'Closest' isn't one thing — L2 is the gap between arrowtips, cosine is the angle — and a metric that mismatches the embedding silently returns garbage. Normalize once and they agree.
    ~7 min
  3. 03
    The Flat index — the 100%-recall baseline
    Wrap the brute-force scan as an index called Flat: recall 1.0 by definition. It's the yardstick every faster index is scored against — and it defines the two axes, recall@k and latency.
    ~7 min
  4. 04
    IVF — carve the space into cells
    k-means splits the space into Voronoi cells; at query time scan only the nprobe cells nearest the query — turning an all-N scan into a fraction of it. The first explicit recall-for-speed trade.
    ~7 min
  5. 05
    The boundary miss — one lonely point
    A cell wall is hard: a true neighbor a hair across it is invisible at low nprobe, even though it's closer than points you do return. Proximity in space ≠ membership in a probed cell.
    ~7 min
  6. 06
    HNSW — hop along a proximity graph
    Link each vector to its M nearest neighbors and search by greedy hops — no hard cell walls, so the lonely boundary point is just one more link away.
    ~7 min
  7. 07
    Express lanes — layers and ef_search
    Skip-list layers add express-lane nodes that cross huge distances in one hop, then descend for local refinement — roughly log search, with ef_search trading beam width for recall until it plateaus.
    ~7 min
  8. 08
    Product Quantization — 6 KB to 96 bytes
    Split each vector into subvectors and store the 1-byte id of each one's nearest prototype: 6 KB → ~96 bytes (~64×). The memory axis finally becomes a knob — at the cost of lossy distances.
    ~7 min
  9. 09
    IVFPQ — skip most cells, compress the rest
    Stack IVF (skip most cells) and PQ (compress the residual): the FAISS billion-scale workhorse — tiny memory and a fraction-of-N scan, paid for with compounded approximation.
    ~7 min
  10. 10
    The trilemma — pick two
    Recall, latency, memory: Flat maxes recall, HNSW speed-at-recall, IVFPQ memory — no index wins all three. Every deployment is a chosen point on the Pareto frontier.
    ~7 min
  11. 11
    Filtered search — pre vs post
    Real queries combine a metadata filter with similarity, and both naive orders break: post-filter can return fewer than k results; pre-filter sets up a subtler failure.
    ~7 min
  12. 12
    When the filter disconnects the graph
    Pre-filtering an HNSW search makes only matching nodes walkable — and if the path to a valid neighbor ran through a filtered-out node, the greedy walk dead-ends and recall craters.
    ~7 min
  13. 13
    Hybrid search — fuse dense and sparse with RRF
    Dense vectors capture meaning but miss exact tokens like SKUs and error codes; sparse keyword search nails them. Run both and fuse by rank with RRF — scale-free, no score calibration.
    ~7 min
  14. 14
    Distribute it — shards, scatter-gather, the LLM stack
    Shard vectors across nodes, scatter-gather every query, merge per-shard top-k — exact only if each shard over-fetches. This is the box that sits next to the KV store in every 2026 LLM app.
    ~7 min
  15. 15
    Design your vector database
    Capstone: pick the index, filter, hybrid, and sharding for RAG vs recommendation vs semantic cache vs billion-on-a-budget — each knob traceable to the scene that justified it.
    ~7 min