Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)
Scene 08 · Product Quantization — 6 KB to 96 bytes
Split each vector into subvectors and store the 1-byte id of each one's nearest prototype: 6 KB → ~96 bytes (~64×). The memory axis finally becomes a knob — at the cost of lossy distances.
Previously
HNSW's RAM bill is the old 6 TB wall in disguise. PQ chops each vector into chunks and stores only which of 256 prototypes each chunk is closest to — 6 KB becomes ~96 bytes. The memory axis is finally a knob we can turn. But the codes are lossy, so recall drops — which raises the question of how to get HNSW-or-better speed AND PQ's tiny footprint at the same time.
Scene 08
Product Quantization — 6 KB to 96 bytes
Diagram
One song's 1536-dim vector is the long bar on the left, sliced into m chunks. Each chunk fires an arrow into the codebook grid of 256 prototypes; the nearest prototype's id (a single byte, 0–255) replaces the chunk. The live byte counter is the whole point: 6,144 bytes of float32 collapse to m bytes. A recall meter sits beside it — it sags when the codes alone are used (they're lossy estimates) and snaps back up when re-rank re-scores the top candidates with the exact vectors. On the persistent Pareto chart, the memory axis (bubble size) is finally live: PQ's dot is the smallest bubble on the board.
Here's the move that finally shrinks memory. Take one song's vector — 1,536 numbers, stored as 4-byte floats, so 6,144 bytes (~6 KB) each. Slice it into m equal chunks. For each chunk, we trained a little catalogue of 256 typical chunk-shapes once, up front (the same k-means idea that drew IVF's cells, run separately inside each slice of the vector). Now we just store, per chunk, *which* of those 256 shapes it's nearest to — a number from 0 to 255, which fits in a single byte. Watch the chunks get rounded one by one and the byte counter fall: 6,144 bytes collapses to 96. That's a ~64× shrink. Distances later get estimated from those stored numbers using a small precomputed lookup table.
Implementation
PQ.encode
store each chunk as the id of its nearest prototype
1def encode(vector, codebooks): # m codebooks, 256 each2 chunks = split(vector, m)3 codes = bytearray(m)4 for j, chunk in enumerate(chunks):5 best = argmin(6 dist(chunk, proto)7 for proto in codebooks[j] # 256 protos8 )9 codes[j] = best # 0..255, one byte10 return codes # m bytes total
PQ.searchADC
score the codes against a full-precision query
1def search(query, codes_table):2 chunks = split(query, m) # query stays full-precision3 lut = [4 [dist(chunks[j], proto) for proto in codebooks[j]]5 for j in range(m) # 256-wide table per chunk6 ]7 scores = []8 for codes in codes_table: # every stored vector9 d = sum(lut[j][codes[j]] for j in range(m))10 scores.append(d) # summed table lookups11 return argsort(scores)
PQ.rerank
re-score the shortlist with exact vectors
1def topK(query, k=10):2 shortlist = search(query)[:N] # N >> k, cheap codes3 if not rerank_enabled:4 return shortlist[:k]5 rescored = [6 (exact_dist(query, full_vectors[i]), i)7 for i in shortlist # only N exact reads8 ]9 return [i for _, i in sorted(rescored)][:k]
Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.