Build your own columnar OLAP store (ClickHouse / Druid style)

OLTP picks one row by key; OLAP scans a billion rows of one column and asks for a percentile. Build the analytical engine that makes that fast: columnar layout, dictionary/RLE/delta compression, vectorized execution, late materialization, MPP shuffle. Internalize why Postgres is 1000× slower than ClickHouse on the same query and why the inverse is also true.

01
The same query: 30 minutes vs 200 ms
Same SELECT, same rows. ClickHouse runs 9000× faster than Postgres because the column store reads only the columns the query touches.
~7 min
02
Same table, two on-disk shapes
Row store interleaves a row's columns contiguously; column store stores each column in its own file. Same rows, rotated 90°.
~7 min
03
Compression — the column store's superpower
Adjacent column values are same-type and often similar, so RLE and dictionary encoding deliver 5–20× shrinkage that fails completely on row pages.
~7 min
04
Vectorized execution — process batches, not tuples
Tuple-at-a-time Volcano is interpreter overhead; processing 1024–8192 column values per call lets the CPU emit SIMD inner loops.
~7 min
05
Writes must be bulk, not per-row
Every INSERT touches every column file; per-row inserts on a 30-column table pay 30× the per-column overhead. Batches or async_insert.
~7 min
06
A part — one batch, frozen on disk
Each batched write lands as an immutable directory of column files plus an index — a part. A table is a stack of parts.
~7 min
07
Merge — and the 'too many parts' crash
Background worker fuses small parts into bigger ones; when write rate exceeds merge rate, parts_to_throw_insert fires and inserts get rejected.
~7 min
08
ORDER BY — filtering becomes range-scan
Rows inside a part are sorted by ORDER BY; WHERE on a leftmost-prefix column is a contiguous range, anything else is a full scan.
~7 min
09
Granules and the sparse primary index
Rows are grouped into 8192-row granules; one index entry per granule keeps the whole table's index in RAM, binary-searched in microseconds.
~7 min
10
Skip indexes — prove a granule has no match, then skip it
Per-granule sketches (minmax / set / bloom) prove 'this granule cannot contain a match' and skip reading it — only useful when values are clustered.
~7 min
11
Materialized views are INSERT triggers
A column-store MV is not a refresh — it's a trigger over the incoming block. Direct INSERT into the target silently bypasses it and drifts.
~7 min
12
Joins — denormalize or pay
Column-store joins are RAM-bound or shuffle-bound; the canonical OLAP answer is to denormalize at write time — storage is cheap, latency is the constraint.
~7 min
13
Design canvas — pick a workload, ship a schema
Every knob — sort key, partition, skip index, MV, denormalization — is a workload-shaped choice. The right observability schema is wrong for BI.
~7 min