Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)
Scene 15 · Design your vector database
Capstone: pick the index, filter, hybrid, and sharding for RAG vs recommendation vs semantic cache vs billion-on-a-budget — each knob traceable to the scene that justified it.
Previously

The same primitives — vector, metric, index, filter, hybrid, shards — configure radically different systems. The capstone is choosing them deliberately for a real workload, with every knob traceable to the scene that justified it, and the trilemma as the compass.

Scene 15
Design your vector database
Diagram
The capstone canvas. Four workload cards are docked across the top — each carries a one-line constraints summary (corpus size, latency budget, RAM, how much recall it can forgive). One card is live at a time. Down the side sits the palette of every knob the arc earned: index type, metric, hybrid, sharding, re-rank. As you choose knobs, the verifier panel grades each one against the LIVE workload — green 'fits', amber 'wasteful' (you're paying on an axis this workload doesn't constrain), red 'violation' (it breaks a hard limit) — and every verdict cites the scene that justifies it.
ACTIVERAG over docs50M docs · recall matters · fit…ragRecommendation500M items · latency-critical ·…recommendationSemantic cache2M Q&As · sub-2 ms · recall for…semantic-cacheBillion-on-a-budget1B vectors · 128 GB RAM hard ca…billion-budgetDESIGN PALETTEIndexFlat (exact)IVF + nprobeHNSW + M + ef_searchIVFPQ (compressed)Metriccosine, normalized at …dot product, un-normal…Hybriddense + sparse, fused …Shardingclustered shards + ove…single node (no shardi…Re-rankre-rank top-N with exa…VERIFIERFITSHNSW + M + ef_search50M fits in RAM and recall matters →…FITScosine, normalized at ingestNormalize once at ingest, then the c…FITSsingle node (no sharding)50M fits one node, so skip the distr…Configuring: RAG over docs — 50M docs · recall matters · fits in RAM · moderate latency
four workloads, one toolkit
every knob traces back to a scene →
Here is everything the arc built, in one place. Four workload cards are docked across the top — *RAG over docs*, *Recommendation*, *Semantic cache*, *Billion-on-a-budget* — and each card states its real constraints: how many vectors, how tight the latency budget, how much RAM, and how much being wrong actually costs. Down the side is the palette: the *index* choices (Flat, IVF, HNSW, IVFPQ), the *metric* choice, *hybrid* (dense + sparse fused by RRF), *sharding*, and *re-rank*. Every one of those knobs was earned in an earlier scene. The job now is not to learn anything new — it's to read each workload's constraints off the *trilemma* (recall, latency, memory) and pick the honest configuration. The same toolkit; four different right answers.
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.