Compression — the column store's superpower — Build a columnar OLAP store (ClickHouse / Druid style)

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 03 · Compression — the column store's superpower

Adjacent column values are same-type and often similar, so RLE and dictionary encoding deliver 5–20× shrinkage that fails completely on row pages.

Previously

Once adjacent bytes are the same column, compression that fails on row pages suddenly works — and the win is per-column, not generic gzip.

Scene 03

Compression: the column store's superpower

Watch

Diagram

One column file (country.bin) shown in three modes. RAW is one tile per row carrying the original 2-byte string. The middle mode collapses adjacent repeats into (value, run_length) tuples — the byte counter drops and a WHERE clause becomes a sum over run lengths instead of a per-row compare. The third mode replaces every string with a small integer code and parks the unique values in a side dictionary — the WHERE clause becomes an 8-bit integer equality. The bytes counter top-right is the per-mode disk cost; the bottom panel shows the predicate the executor actually runs on the encoded form.

country.bin streams in — 20 rows of a low-cardinality string column (US, DE, JP). The byte counter top-right tracks the cost of the raw layout: 2 bytes per row, 40 bytes total. Notice how adjacent rows often share the same value — that's the property the next mode will exploit.

Implementation

Column.rleEncode(values)

collapse adjacent repeats into (value, run_length) tuples

1def rle_encode(values):
2    runs = []
3    for v in values:
4        last = runs[-1] if runs else None
5        if last and last.value == v:
6            last.run += 1   # extend current run
7        else:
8            runs.append(Run(value=v, run=1))
9    return runs            # on-disk form for RLE columns

Executor.countWhere(column, value='US')

the same WHERE clause, three encodings, three scans

1def count_where(col, target):
2    if col.mode == RAW:
3        return sum(1 for v in col.values if v == target)
4    if col.mode == RLE:
5        # no decompression — sum run lengths of matching runs
6        return sum(r.run for r in col.runs if r.value == target)
7    if col.mode == DICT:
8        code = col.dictionary.codeFor(target)  # 'US' -> 2
9        return sum(1 for c in col.codes if c == code)

Column.dictEncode(values)

intern each distinct value; column becomes integer codes

1def dict_encode(values):
2    dictionary = {}        # value -> small int code
3    codes = []
4    for v in values:
5        if v not in dictionary:
6            dictionary[v] = len(dictionary)  # next code
7        codes.append(dictionary[v])
8    # cardinality check — the LowCardinality(String) trap
9    # fires when len(dictionary) approaches len(values).
10    return codes, dictionary

PreviousSame table, two on-disk shapes NextVectorized execution — process batches, not tuples