Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)
Scene 01 · Find the most similar thing — fast
Turn items into lists of numbers and 'most similar' becomes 'closest list' — but checking every one is exact and hopeless at a billion. That wall is why this whole system exists.
Scene 01
Find the most similar thing — fast
Diagram
A flat 2D map where every song is plotted as a point — and that point IS the song's embedding (its vector): the list of numbers we turned the song into. Left-to-right is acoustic→electronic, bottom-to-top is calm→energetic. The blue 'Now Playing' marker is the query; the faint lines fanning out from it are similarity search by brute force — measuring the straight-line gap from the query's vector to each song's vector, one at a time, to rank them by closeness. The side panel scales the same idea from 14 songs to a billion and shows the wall-clock blowing past the 10 ms budget.
An app needs to answer one question over and over: 'find the items most similar to this one.' To make 'similar' something a computer can compute, we turn every item into a short list of numbers — here, each song becomes two numbers: how acoustic-vs-electronic it is, and how calm-vs-energetic. Now 'the most similar song' just means 'the song whose list of numbers is closest to mine.' Watch: 'Now Playing' lights up, and the search measures the straight-line gap to every single song, one at a time, before it can rank them. For 14 songs that's instant. Hold onto one fact: to be SURE of the closest, it had to touch every point.
Implementation
FlatIndex.search
the brute-force scan: rank every stored vector, keep the best k
1def search(query, k):2 heap = [] # k smallest distances so far3 for vec in all_vectors: # touches all N4 d = distance(query, vec)5 heap.push((d, vec.id))6 if len(heap) > k:7 heap.pop_largest()8 return argsort(heap)[:k] # the true top-k
distance
one gap = a sum over EVERY dimension of the vector
1def distance(query, vec):2 total = 0.03 # every dimension counts — closeness lives in all D4 for i in range(D): # D = 15365 diff = query[i] - vec[i]6 total += diff * diff # one multiply-add7 return total
FlatIndex.search_parallel
split the vectors across cores, then merge the partial top-k
1def search_parallel(query, k, cores):2 shards = split(all_vectors, cores)3 partials = parallel_map(4 lambda shard: scan(shard, query, k),5 shards,6 )7 # each core still scans its whole shard:8 # sum of shard sizes == N, work is unchanged9 return merge_top_k(partials, k)
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.