Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 02 · Same table, two on-disk shapes
Row store interleaves a row's columns contiguously; column store stores each column in its own file. Same rows, rotated 90°.
Previously
If both engines hold the same 1.2 billion rows, the only thing that can explain a 9000x gap is what each engine actually pulls off disk — so we need to look at the physical layout under the rows. Here's that physical layout: same rows, two arrangements.
Scene 02
Same table, two on-disk shapes
Diagram
The same 5-row × 4-column events table laid out two ways. LEFT panel: a **page (row store)** — one on-disk file holding every column of row 1, then every column of row 2, and so on. RIGHT panel: four **column file**s (id.bin / country.bin / latency_ms.bin / ts.bin), each holding only its column's values in the same row order. The bottom labels track bytes-read on each side, and the highlight overlay shows which tiles the current query actually touches.
page (row store): one file, every column on every row →
↑ row store reads all 4 tiles per row even though the query needs 1
column file: one .bin per column — only latency_ms.bin glows →
Same five rows, two layouts. The row strip is the page on disk — every column of every row glued together. The four .bin strips are the column files — each one holds one column end-to-end. SELECT avg(latency_ms) lights latency_ms in both panels: the row store sweeps every tile anyway; the column store streams exactly one file.
Implementation
RowStore.scan(query)
open the page, walk every row, skip past unwanted columns
1def scan(query):2 page = openPage('events.page')3 out = []4 for row_offset in page.row_offsets: # every row5 row = []6 for col in SCHEMA: # all 4 columns, every time7 bytes = page.read(row_offset, col.width)8 if col.name in query.columns:9 row.append(decode(bytes, col.type))10 row_offset += col.width # skip past unwanted11 out.append(row)12 return out
ColumnStore.scan(query)
open one .bin per column, stream end-to-end, then zip
1def scan(query):2 streams = [open(f'{c}.bin') for c in query.columns]3 if len(streams) == 1:4 return list(streams[0]) # zero-waste single stream5 return zipByOrdinal(streams) # tuple reconstruction
ColumnStore.zipByOrdinal(streams)
tuple reconstruction — position k in each file is row k
1def zipByOrdinal(streams):2 out = []3 for k in range(rowCountInPart()):4 # value at position k in each .bin belongs to row k.5 # no row_id stored; alignment is purely by index.6 row = tuple(stream.readAt(k) for stream in streams)7 out.append(row)8 return out