Build a columnar OLAP store (ClickHouse / Druid style)
13 scenes · ~91 min · build the primitive

Build your own columnar OLAP store (ClickHouse / Druid style)

OLTP picks one row by key; OLAP scans a billion rows of one column and asks for a percentile. Build the analytical engine that makes that fast: columnar layout, dictionary/RLE/delta compression, vectorized execution, late materialization, MPP shuffle. Internalize why Postgres is 1000× slower than ClickHouse on the same query and why the inverse is also true.

  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06
  7. 07
  8. 08
  9. 09
  10. 10
  11. 11
  12. 12
  13. 13
  1. 01
    The same query: 30 minutes vs 200 ms
    Same SELECT, same rows. ClickHouse runs 9000× faster than Postgres because the column store reads only the columns the query touches.
    ~7 min
  2. 02
    Same table, two on-disk shapes
    Row store interleaves a row's columns contiguously; column store stores each column in its own file. Same rows, rotated 90°.
    ~7 min
  3. 03
    Compression — the column store's superpower
    Adjacent column values are same-type and often similar, so RLE and dictionary encoding deliver 5–20× shrinkage that fails completely on row pages.
    ~7 min
  4. 04
    Vectorized execution — process batches, not tuples
    Tuple-at-a-time Volcano is interpreter overhead; processing 1024–8192 column values per call lets the CPU emit SIMD inner loops.
    ~7 min
  5. 05
    Writes must be bulk, not per-row
    Every INSERT touches every column file; per-row inserts on a 30-column table pay 30× the per-column overhead. Batches or async_insert.
    ~7 min
  6. 06
    A part — one batch, frozen on disk
    Each batched write lands as an immutable directory of column files plus an index — a part. A table is a stack of parts.
    ~7 min
  7. 07
    Merge — and the 'too many parts' crash
    Background worker fuses small parts into bigger ones; when write rate exceeds merge rate, parts_to_throw_insert fires and inserts get rejected.
    ~7 min
  8. 08
    ORDER BY — filtering becomes range-scan
    Rows inside a part are sorted by ORDER BY; WHERE on a leftmost-prefix column is a contiguous range, anything else is a full scan.
    ~7 min
  9. 09
    Granules and the sparse primary index
    Rows are grouped into 8192-row granules; one index entry per granule keeps the whole table's index in RAM, binary-searched in microseconds.
    ~7 min
  10. 10
    Skip indexes — prove a granule has no match, then skip it
    Per-granule sketches (minmax / set / bloom) prove 'this granule cannot contain a match' and skip reading it — only useful when values are clustered.
    ~7 min
  11. 11
    Materialized views are INSERT triggers
    A column-store MV is not a refresh — it's a trigger over the incoming block. Direct INSERT into the target silently bypasses it and drifts.
    ~7 min
  12. 12
    Joins — denormalize or pay
    Column-store joins are RAM-bound or shuffle-bound; the canonical OLAP answer is to denormalize at write time — storage is cheap, latency is the constraint.
    ~7 min
  13. 13
    Design canvas — pick a workload, ship a schema
    Every knob — sort key, partition, skip index, MV, denormalization — is a workload-shaped choice. The right observability schema is wrong for BI.
    ~7 min