All scenes
Build a columnar OLAP store (ClickHouse / Druid style)
13 scenes · ~91 min · build the primitive
Build your own columnar OLAP store (ClickHouse / Druid style)
OLTP picks one row by key; OLAP scans a billion rows of one column and asks for a percentile. Build the analytical engine that makes that fast: columnar layout, dictionary/RLE/delta compression, vectorized execution, late materialization, MPP shuffle. Internalize why Postgres is 1000× slower than ClickHouse on the same query and why the inverse is also true.
- 01The same query: 30 minutes vs 200 msSame SELECT, same rows. ClickHouse runs 9000× faster than Postgres because the column store reads only the columns the query touches.~7 min
- 02Same table, two on-disk shapesRow store interleaves a row's columns contiguously; column store stores each column in its own file. Same rows, rotated 90°.~7 min
- 03Compression — the column store's superpowerAdjacent column values are same-type and often similar, so RLE and dictionary encoding deliver 5–20× shrinkage that fails completely on row pages.~7 min
- 04Vectorized execution — process batches, not tuplesTuple-at-a-time Volcano is interpreter overhead; processing 1024–8192 column values per call lets the CPU emit SIMD inner loops.~7 min
- 05Writes must be bulk, not per-rowEvery INSERT touches every column file; per-row inserts on a 30-column table pay 30× the per-column overhead. Batches or async_insert.~7 min
- 06A part — one batch, frozen on diskEach batched write lands as an immutable directory of column files plus an index — a part. A table is a stack of parts.~7 min
- 07Merge — and the 'too many parts' crashBackground worker fuses small parts into bigger ones; when write rate exceeds merge rate, parts_to_throw_insert fires and inserts get rejected.~7 min
- 08ORDER BY — filtering becomes range-scanRows inside a part are sorted by ORDER BY; WHERE on a leftmost-prefix column is a contiguous range, anything else is a full scan.~7 min
- 09Granules and the sparse primary indexRows are grouped into 8192-row granules; one index entry per granule keeps the whole table's index in RAM, binary-searched in microseconds.~7 min
- 10Skip indexes — prove a granule has no match, then skip itPer-granule sketches (minmax / set / bloom) prove 'this granule cannot contain a match' and skip reading it — only useful when values are clustered.~7 min
- 11Materialized views are INSERT triggersA column-store MV is not a refresh — it's a trigger over the incoming block. Direct INSERT into the target silently bypasses it and drifts.~7 min
- 12Joins — denormalize or payColumn-store joins are RAM-bound or shuffle-bound; the canonical OLAP answer is to denormalize at write time — storage is cheap, latency is the constraint.~7 min
- 13Design canvas — pick a workload, ship a schemaEvery knob — sort key, partition, skip index, MV, denormalization — is a workload-shaped choice. The right observability schema is wrong for BI.~7 min