Find the most similar thing — fast — Build a vector database (Pinecone / Weaviate / pgvector style)

Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)

Scene 01 · Find the most similar thing — fast

Turn items into lists of numbers and 'most similar' becomes 'closest list' — but checking every one is exact and hopeless at a billion. That wall is why this whole system exists.

Scene 01

Find the most similar thing — fast

Watch

Diagram

A flat 2D map where every song is plotted as a point — and that point IS the song's embedding (its vector): the list of numbers we turned the song into. Left-to-right is acoustic→electronic, bottom-to-top is calm→energetic. The blue 'Now Playing' marker is the query; the faint lines fanning out from it are similarity search by brute force — measuring the straight-line gap from the query's vector to each song's vector, one at a time, to rank them by closeness. The side panel scales the same idea from 14 songs to a billion and shows the wall-clock blowing past the 10 ms budget.

Sources

An app needs to answer one question over and over: 'find the items most similar to this one.' To make 'similar' something a computer can compute, we turn every item into a short list of numbers — here, each song becomes two numbers: how acoustic-vs-electronic it is, and how calm-vs-energetic. Now 'the most similar song' just means 'the song whose list of numbers is closest to mine.' Watch: 'Now Playing' lights up, and the search measures the straight-line gap to every single song, one at a time, before it can rank them. For 14 songs that's instant. Hold onto one fact: to be SURE of the closest, it had to touch every point.

Implementation

FlatIndex.search

the brute-force scan: rank every stored vector, keep the best k

1def search(query, k):
2    heap = []  # k smallest distances so far
3    for vec in all_vectors:        # touches all N
4        d = distance(query, vec)
5        heap.push((d, vec.id))
6        if len(heap) > k:
7            heap.pop_largest()
8    return argsort(heap)[:k]       # the true top-k

distance

one gap = a sum over EVERY dimension of the vector

1def distance(query, vec):
2    total = 0.0
3    # every dimension counts — closeness lives in all D
4    for i in range(D):             # D = 1536
5        diff = query[i] - vec[i]
6        total += diff * diff       # one multiply-add
7    return total

FlatIndex.search_parallel

split the vectors across cores, then merge the partial top-k

1def search_parallel(query, k, cores):
2    shards = split(all_vectors, cores)
3    partials = parallel_map(
4        lambda shard: scan(shard, query, k),
5        shards,
6    )
7    # each core still scans its whole shard:
8    # sum of shard sizes == N, work is unchanged
9    return merge_top_k(partials, k)

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

NextThe vector and the distance metric