Distribute it — shards, scatter-gather, the LLM stack — Build a vector database (Pinecone / Weaviate / pgvector style)

Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)

Scene 14 · Distribute it — shards, scatter-gather, the LLM stack

Shard vectors across nodes, scatter-gather every query, merge per-shard top-k — exact only if each shard over-fetches. This is the box that sits next to the KV store in every 2026 LLM app.

Previously

Hybrid search gave us a feature-complete engine on one machine — but scene 1's wall returns: a billion vectors don't fit on one box. So we shard across nodes and scatter-gather: every shard returns its local best, a coordinator merges the global best. And once it's distributed, it's worth stepping back to see where this entire engine sits in a real 2026 system.

Scene 14

Distribute it — shards, scatter-gather, and the LLM stack

Watch

Diagram

A coordinator on top fans the query 'Now Playing' out to N shards below, each a mini vector index (HNSW or IVFPQ) holding a slice of the data. Every shard returns its own local closest results; the coordinator merges them into the global top-3. A framing panel can flip on to show where this whole box lives in 2026: a vector store (fuzzy lookups by meaning) right beside a KV store (exact lookups by id).

Sources

A billion vectors can't live on one machine, so we split — *shard* — them across several nodes, and copy (*replicate*) each shard so a node failure doesn't lose data. Each shard is a complete mini vector index from earlier scenes (an HNSW graph or an IVFPQ index). When a query arrives, the coordinator fans it out to every shard at once; each shard searches its own slice and returns its local closest songs. Watch the three shards report back, one by one. Notice the asymmetry that makes this affordable: building those indexes is slow and expensive, but it's done once and amortized — queries are cheap and run hot, millions of times over the same built index.

Implementation

Coordinator.search

fan the query to every shard, then merge their answers

1def search(query, k, candidates):
2    targets = route(query)
3    # scatter: ask all shards in parallel
4    replies = parallel_map(
5        targets,
6        lambda s: s.local_search(query, candidates),
7    )
8    # gather: pool every returned candidate
9    pooled = flatten(replies)
10    pooled.sort(key=lambda c: dist(query, c))
11    return pooled[:k]

Shard.local_search

each shard searches its own slice and returns its closest few

1def local_search(query, candidates):
2    # self.index is a full mini vector index:
3    # an HNSW graph or an IVFPQ index over this slice
4    hits = self.index.query(query)
5    hits.sort(key=lambda c: dist(query, c))
6    # only the closest `candidates` leave the shard
7    return hits[:candidates]

Coordinator.route

which shards a query has to touch

1def route(query):
2    if sharding == 'clustered':
3        # similar vectors share a shard, so a query
4        # only needs the nearest buckets
5        return nearest_shards(query)
6    # random spread: every query must hit all shards
7    return all_shards

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

PreviousHybrid search — fuse dense and sparse with RRF NextDesign your vector database