Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)
Scene 14 · Distribute it — shards, scatter-gather, the LLM stack
Shard vectors across nodes, scatter-gather every query, merge per-shard top-k — exact only if each shard over-fetches. This is the box that sits next to the KV store in every 2026 LLM app.
Previously
Hybrid search gave us a feature-complete engine on one machine — but scene 1's wall returns: a billion vectors don't fit on one box. So we shard across nodes and scatter-gather: every shard returns its local best, a coordinator merges the global best. And once it's distributed, it's worth stepping back to see where this entire engine sits in a real 2026 system.
Scene 14
Distribute it — shards, scatter-gather, and the LLM stack
Diagram
A coordinator on top fans the query 'Now Playing' out to N shards below, each a mini vector index (HNSW or IVFPQ) holding a slice of the data. Every shard returns its own local closest results; the coordinator merges them into the global top-3. A framing panel can flip on to show where this whole box lives in 2026: a vector store (fuzzy lookups by meaning) right beside a KV store (exact lookups by id).
A billion vectors can't live on one machine, so we split — *shard* — them across several nodes, and copy (*replicate*) each shard so a node failure doesn't lose data. Each shard is a complete mini vector index from earlier scenes (an HNSW graph or an IVFPQ index). When a query arrives, the coordinator fans it out to every shard at once; each shard searches its own slice and returns its local closest songs. Watch the three shards report back, one by one. Notice the asymmetry that makes this affordable: building those indexes is slow and expensive, but it's done once and amortized — queries are cheap and run hot, millions of times over the same built index.
Implementation
Coordinator.search
fan the query to every shard, then merge their answers
1def search(query, k, candidates):2 targets = route(query)3 # scatter: ask all shards in parallel4 replies = parallel_map(5 targets,6 lambda s: s.local_search(query, candidates),7 )8 # gather: pool every returned candidate9 pooled = flatten(replies)10 pooled.sort(key=lambda c: dist(query, c))11 return pooled[:k]
Shard.local_search
each shard searches its own slice and returns its closest few
1def local_search(query, candidates):2 # self.index is a full mini vector index:3 # an HNSW graph or an IVFPQ index over this slice4 hits = self.index.query(query)5 hits.sort(key=lambda c: dist(query, c))6 # only the closest `candidates` leave the shard7 return hits[:candidates]
Coordinator.route
which shards a query has to touch
1def route(query):2 if sharding == 'clustered':3 # similar vectors share a shard, so a query4 # only needs the nearest buckets5 return nearest_shards(query)6 # random spread: every query must hit all shards7 return all_shards
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.