Same table, two on-disk shapes

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 02 · Same table, two on-disk shapes

Row store interleaves a row's columns contiguously; column store stores each column in its own file. Same rows, rotated 90°.

Previously

If both engines hold the same 1.2 billion rows, the only thing that can explain a 9000x gap is what each engine actually pulls off disk — so we need to look at the physical layout under the rows. Here's that physical layout: same rows, two arrangements.

Scene 02

Same table, two on-disk shapes

Watch

Diagram

The same 5-row × 4-column events table laid out two ways. LEFT panel: a **page (row store)** — one on-disk file holding every column of row 1, then every column of row 2, and so on. RIGHT panel: four **column file**s (id.bin / country.bin / latency_ms.bin / ts.bin), each holding only its column's values in the same row order. The bottom labels track bytes-read on each side, and the highlight overlay shows which tiles the current query actually touches.

Sources

page (row store): one file, every column on every row →

↑ row store reads all 4 tiles per row even though the query needs 1

column file: one .bin per column — only latency_ms.bin glows →

Same five rows, two layouts. The row strip is the page on disk — every column of every row glued together. The four .bin strips are the column files — each one holds one column end-to-end. SELECT avg(latency_ms) lights latency_ms in both panels: the row store sweeps every tile anyway; the column store streams exactly one file.

Implementation

RowStore.scan(query)

open the page, walk every row, skip past unwanted columns

1def scan(query):
2    page = openPage('events.page')
3    out = []
4    for row_offset in page.row_offsets:  # every row
5        row = []
6        for col in SCHEMA:  # all 4 columns, every time
7            bytes = page.read(row_offset, col.width)
8            if col.name in query.columns:
9                row.append(decode(bytes, col.type))
10            row_offset += col.width  # skip past unwanted
11        out.append(row)
12    return out

ColumnStore.scan(query)

open one .bin per column, stream end-to-end, then zip

1def scan(query):
2    streams = [open(f'{c}.bin') for c in query.columns]
3    if len(streams) == 1:
4        return list(streams[0])  # zero-waste single stream
5    return zipByOrdinal(streams)  # tuple reconstruction

ColumnStore.zipByOrdinal(streams)

tuple reconstruction — position k in each file is row k

1def zipByOrdinal(streams):
2    out = []
3    for k in range(rowCountInPart()):
4        # value at position k in each .bin belongs to row k.
5        # no row_id stored; alignment is purely by index.
6        row = tuple(stream.readAt(k) for stream in streams)
7        out.append(row)
8    return out

PreviousThe same query: 30 minutes vs 200 ms NextCompression — the column store's superpower