Product Quantization — 6 KB to 96 bytes — Build a vector database (Pinecone / Weaviate / pgvector style)

Build a vector database (Pinecone / Weaviate / pgvector style) (15 scenes)

Scene 08 · Product Quantization — 6 KB to 96 bytes

Split each vector into subvectors and store the 1-byte id of each one's nearest prototype: 6 KB → ~96 bytes (~64×). The memory axis finally becomes a knob — at the cost of lossy distances.

Previously

HNSW's RAM bill is the old 6 TB wall in disguise. PQ chops each vector into chunks and stores only which of 256 prototypes each chunk is closest to — 6 KB becomes ~96 bytes. The memory axis is finally a knob we can turn. But the codes are lossy, so recall drops — which raises the question of how to get HNSW-or-better speed AND PQ's tiny footprint at the same time.

Scene 08

Product Quantization — 6 KB to 96 bytes

Watch

Diagram

One song's 1536-dim vector is the long bar on the left, sliced into m chunks. Each chunk fires an arrow into the codebook grid of 256 prototypes; the nearest prototype's id (a single byte, 0–255) replaces the chunk. The live byte counter is the whole point: 6,144 bytes of float32 collapse to m bytes. A recall meter sits beside it — it sags when the codes alone are used (they're lossy estimates) and snaps back up when re-rank re-scores the top candidates with the exact vectors. On the persistent Pareto chart, the memory axis (bubble size) is finally live: PQ's dot is the smallest bubble on the board.

Sources

Here's the move that finally shrinks memory. Take one song's vector — 1,536 numbers, stored as 4-byte floats, so 6,144 bytes (~6 KB) each. Slice it into m equal chunks. For each chunk, we trained a little catalogue of 256 typical chunk-shapes once, up front (the same k-means idea that drew IVF's cells, run separately inside each slice of the vector). Now we just store, per chunk, *which* of those 256 shapes it's nearest to — a number from 0 to 255, which fits in a single byte. Watch the chunks get rounded one by one and the byte counter fall: 6,144 bytes collapses to 96. That's a ~64× shrink. Distances later get estimated from those stored numbers using a small precomputed lookup table.

Implementation

PQ.encode

store each chunk as the id of its nearest prototype

1def encode(vector, codebooks):  # m codebooks, 256 each
2    chunks = split(vector, m)
3    codes = bytearray(m)
4    for j, chunk in enumerate(chunks):
5        best = argmin(
6            dist(chunk, proto)
7            for proto in codebooks[j]   # 256 protos
8        )
9        codes[j] = best                 # 0..255, one byte
10    return codes                        # m bytes total

PQ.searchADC

score the codes against a full-precision query

1def search(query, codes_table):
2    chunks = split(query, m)   # query stays full-precision
3    lut = [
4        [dist(chunks[j], proto) for proto in codebooks[j]]
5        for j in range(m)      # 256-wide table per chunk
6    ]
7    scores = []
8    for codes in codes_table:  # every stored vector
9        d = sum(lut[j][codes[j]] for j in range(m))
10        scores.append(d)       # summed table lookups
11    return argsort(scores)

PQ.rerank

re-score the shortlist with exact vectors

1def topK(query, k=10):
2    shortlist = search(query)[:N]   # N >> k, cheap codes
3    if not rerank_enabled:
4        return shortlist[:k]
5    rescored = [
6        (exact_dist(query, full_vectors[i]), i)
7        for i in shortlist          # only N exact reads
8    ]
9    return [i for _, i in sorted(rescored)][:k]

Not sure what to ask? Tap a question — the staff engineer answers in the chat panel.

PreviousExpress lanes — layers and ef_search NextIVFPQ — skip most cells, compress the rest