Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 01 · The same query: 30 minutes vs 200 ms
Same SELECT, same rows. ClickHouse runs 9000× faster than Postgres because the column store reads only the columns the query touches.
Scene 01
The same query: 30 minutes vs 200 ms
Diagram
One query bar fires against two engines side by side. Left is a **row store** (Postgres): each row's columns are stored together on disk. Right is a **column store** (ClickHouse): each column lives in its own file. Both panels show bytes pulled off disk and wall-clock time for the same SELECT.
QUERYSELECT country, avg(latency) FROM events WHERE ts > now()-1h GROUP BY countryNARROW · 3 / 30Postgres (row store)~9000x gapreads ALL 30 columns per rowBYTES READ360 GBWALL CLOCK30:00ClickHouse (column store)reads 3 of 30 column filesBYTES READ28 GBWALL CLOCK0:00.200Narrow query, wide table — the row store hauls all 30 columns; the column store reads only 3.
row store — every column of every row sits together on disk
column store — one file per column, read only what the query asks for
this 9000x gap is what 'OLAP' is named after — Online Analytical Processing
Same query, two engines. The row store pulls 360 GB off disk and spins to 30 minutes. The column store pulls 28 GB and finishes at 200 ms. Same hardware, same row count — only the on-disk shape differs.
Implementation
RowStore.scan(query)
every page holds all N columns — pays the full row width
1def scan(query): # Postgres-shaped
2 bytes_read = 0
3 for page in heap.pages(): # 8 KB pages, row-major
4 for row in page.rows(): # all N cols interleaved
5 bytes_read += row.width # = sum(width(c) for c in N)
6 tuple = decode(row) # 30 fields materialized
7 project = [tuple[c] for c in query.columns]
8 emit(project) # K-of-N kept, N-K discarded
9 return bytes_read # ~= table_size_bytes
ColumnStore.scan(query)
open one file per queried column — skip the rest entirely
1def scan(query): # ClickHouse-shaped
2 bytes_read = 0
3 files = [open(f'{c}.bin') for c in query.columns] # K files
4 # the other (N - K) column files are never opened
5 for granule in zipGranules(files): # vector of K cols
6 bytes_read += sum(len(b) for b in granule)
7 emit(granule) # already projected
8 return bytes_read # ~= (K / N) * table_size
compare(query, table)
the K/N ratio that the slider is sweeping
1def compare(query, table):
2 N = table.totalColumns # 30 or 100
3 K = len(query.columns) # 3, 30, or 1
4 row_bytes = table.size_bytes # always full table
5 col_bytes = table.size_bytes * (K / N)
6 ratio = row_bytes / col_bytes # = N / K
7 return (row_bytes, col_bytes, ratio)