Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 02 · Same table, two on-disk shapes
Row store interleaves a row's columns contiguously; column store stores each column in its own file. Same rows, rotated 90°.
Previously

If both engines hold the same 1.2 billion rows, the only thing that can explain a 9000x gap is what each engine actually pulls off disk — so we need to look at the physical layout under the rows. Here's that physical layout: same rows, two arrangements.

Scene 02
Same table, two on-disk shapes
Diagram
The same 5-row × 4-column events table laid out two ways. LEFT panel: a **page (row store)** — one on-disk file holding every column of row 1, then every column of row 2, and so on. RIGHT panel: four **column file**s (id.bin / country.bin / latency_ms.bin / ts.bin), each holding only its column's values in the same row order. The bottom labels track bytes-read on each side, and the highlight overlay shows which tiles the current query actually touches.
idcountrylatency_msts1US4217000000002US5117000000013DE3917000000024US4417000000035DE601700000004Row store on diskone big strip: [row1][row2]…cursor pays full row width: reads all 4 tiles × 5 rows = 160 B (uses 1)Column store on disk4 files, one per columnid.bin12345country.binUSUSDEUSDElatency_ms.bin4251394460ts.bin17000000001700000001170000000217000000031700000004streams: opens 1 of 4 files = 40 BSELECT avg(latency_ms) — Narrow query: row store sweeps every cell; column store streams one strip (latency.bin) end-to-end with …
page (row store): one file, every column on every row →
↑ row store reads all 4 tiles per row even though the query needs 1
column file: one .bin per column — only latency_ms.bin glows →
Same five rows, two layouts. The row strip is the page on disk — every column of every row glued together. The four .bin strips are the column files — each one holds one column end-to-end. SELECT avg(latency_ms) lights latency_ms in both panels: the row store sweeps every tile anyway; the column store streams exactly one file.
Implementation
RowStore.scan(query)
open the page, walk every row, skip past unwanted columns
1def scan(query):
2 page = openPage('events.page')
3 out = []
4 for row_offset in page.row_offsets: # every row
5 row = []
6 for col in SCHEMA: # all 4 columns, every time
7 bytes = page.read(row_offset, col.width)
8 if col.name in query.columns:
9 row.append(decode(bytes, col.type))
10 row_offset += col.width # skip past unwanted
11 out.append(row)
12 return out
ColumnStore.scan(query)
open one .bin per column, stream end-to-end, then zip
1def scan(query):
2 streams = [open(f'{c}.bin') for c in query.columns]
3 if len(streams) == 1:
4 return list(streams[0]) # zero-waste single stream
5 return zipByOrdinal(streams) # tuple reconstruction
ColumnStore.zipByOrdinal(streams)
tuple reconstruction — position k in each file is row k
1def zipByOrdinal(streams):
2 out = []
3 for k in range(rowCountInPart()):
4 # value at position k in each .bin belongs to row k.
5 # no row_id stored; alignment is purely by index.
6 row = tuple(stream.readAt(k) for stream in streams)
7 out.append(row)
8 return out