Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 03 · Compression — the column store's superpower
Adjacent column values are same-type and often similar, so RLE and dictionary encoding deliver 5–20× shrinkage that fails completely on row pages.
Previously
Once adjacent bytes are the same column, compression that fails on row pages suddenly works — and the win is per-column, not generic gzip.
Scene 03
Compression: the column store's superpower
Diagram
One column file (country.bin) shown in three modes. RAW is one tile per row carrying the original 2-byte string. The middle mode collapses adjacent repeats into (value, run_length) tuples — the byte counter drops and a WHERE clause becomes a sum over run lengths instead of a per-row compare. The third mode replaces every string with a small integer code and parks the unique values in a side dictionary — the WHERE clause becomes an 8-bit integer equality. The bytes counter top-right is the per-mode disk cost; the bottom panel shows the predicate the executor actually runs on the encoded form.
country.bin streams in — 20 rows of a low-cardinality string column (US, DE, JP). The byte counter top-right tracks the cost of the raw layout: 2 bytes per row, 40 bytes total. Notice how adjacent rows often share the same value — that's the property the next mode will exploit.
Implementation
Column.rleEncode(values)
collapse adjacent repeats into (value, run_length) tuples
1def rle_encode(values):2 runs = []3 for v in values:4 last = runs[-1] if runs else None5 if last and last.value == v:6 last.run += 1 # extend current run7 else:8 runs.append(Run(value=v, run=1))9 return runs # on-disk form for RLE columns
Executor.countWhere(column, value='US')
the same WHERE clause, three encodings, three scans
1def count_where(col, target):2 if col.mode == RAW:3 return sum(1 for v in col.values if v == target)4 if col.mode == RLE:5 # no decompression — sum run lengths of matching runs6 return sum(r.run for r in col.runs if r.value == target)7 if col.mode == DICT:8 code = col.dictionary.codeFor(target) # 'US' -> 29 return sum(1 for c in col.codes if c == code)
Column.dictEncode(values)
intern each distinct value; column becomes integer codes
1def dict_encode(values):2 dictionary = {} # value -> small int code3 codes = []4 for v in values:5 if v not in dictionary:6 dictionary[v] = len(dictionary) # next code7 codes.append(dictionary[v])8 # cardinality check — the LowCardinality(String) trap9 # fires when len(dictionary) approaches len(values).10 return codes, dictionary