Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 03 · Compression — the column store's superpower
Adjacent column values are same-type and often similar, so RLE and dictionary encoding deliver 5–20× shrinkage that fails completely on row pages.
Previously

Once adjacent bytes are the same column, compression that fails on row pages suddenly works — and the win is per-column, not generic gzip.

Scene 03
Compression: the column store's superpower
Diagram
One column file (country.bin) shown in three modes. RAW is one tile per row carrying the original 2-byte string. The middle mode collapses adjacent repeats into (value, run_length) tuples — the byte counter drops and a WHERE clause becomes a sum over run lengths instead of a per-row compare. The third mode replaces every string with a small integer code and parks the unique values in a side dictionary — the WHERE clause becomes an 8-bit integer equality. The bytes counter top-right is the per-mode disk cost; the bottom panel shows the predicate the executor actually runs on the encoded form.
COLUMN FILEcountry.binWHERE country = 'US'MODE: RAWBYTES0 B1 string tilesRAW VALUESOPERATES ON ENCODED FORM:memcmp(bytes, 'US') // 2-byte string compare per rowRaw: each row carries its own 2-byte country code. WHERE country='US' is a per-row string compare.
country.bin streams in — 20 rows of a low-cardinality string column (US, DE, JP). The byte counter top-right tracks the cost of the raw layout: 2 bytes per row, 40 bytes total. Notice how adjacent rows often share the same value — that's the property the next mode will exploit.
Implementation
Column.rleEncode(values)
collapse adjacent repeats into (value, run_length) tuples
1def rle_encode(values):
2 runs = []
3 for v in values:
4 last = runs[-1] if runs else None
5 if last and last.value == v:
6 last.run += 1 # extend current run
7 else:
8 runs.append(Run(value=v, run=1))
9 return runs # on-disk form for RLE columns
Executor.countWhere(column, value='US')
the same WHERE clause, three encodings, three scans
1def count_where(col, target):
2 if col.mode == RAW:
3 return sum(1 for v in col.values if v == target)
4 if col.mode == RLE:
5 # no decompression — sum run lengths of matching runs
6 return sum(r.run for r in col.runs if r.value == target)
7 if col.mode == DICT:
8 code = col.dictionary.codeFor(target) # 'US' -> 2
9 return sum(1 for c in col.codes if c == code)
Column.dictEncode(values)
intern each distinct value; column becomes integer codes
1def dict_encode(values):
2 dictionary = {} # value -> small int code
3 codes = []
4 for v in values:
5 if v not in dictionary:
6 dictionary[v] = len(dictionary) # next code
7 codes.append(dictionary[v])
8 # cardinality check — the LowCardinality(String) trap
9 # fires when len(dictionary) approaches len(values).
10 return codes, dictionary