Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 09 · Granules and the sparse primary index
Rows are grouped into 8192-row granules; one index entry per granule keeps the whole table's index in RAM, binary-searched in microseconds.
Previously

If rows inside a part are sorted, the engine doesn't need an index entry per row — it can keep one entry per group of 8192 rows and still find a range fast. That gives us an index small enough to keep entirely in memory.

Scene 09
Granules and the sparse primary index
Diagram
The horizontal strip is ONE part, sliced into **granules** — fixed-size blocks of 8,192 rows that are the unit the index points at. The dashed panel above is the **sparse primary index** (primary.idx) — one entry per granule, not per row; the whole thing is small enough to live in RAM (1B rows ≈ 1 MB index). A WHERE predicate triggers a binary search across the index entries; the surviving granule range lights green, and only those granules read bytes from `.bin`. The mark file at the bottom translates granule N into a byte offset.
QUERYWHERE service='api' AND ts > '2026-04-01'INDEX FOOTPRINT1.2 MB / 1B rowsprimary.idx — one entry per granuleONE PART · 120 granules · 8,192 rows each.mrk2 — granule N → (compressed block offset, uncompressed offset)rows examined: ~49,152 of 1B6 of 120 granules readPart: 1,083 granules · index_granularity = 8,192 rows per granule.
A WHERE service='api' AND ts > T query enters. The primary.idx (in RAM) is binary-searched in O(log G) ≈ 7 hops; granules 42..47 survive and the rest stay dim. Watch the four stages.
Implementation
Part.build_primary_index()
one entry per granule of N rows — that's the whole trick
1def build_primary_index(part):
2 g = cfg.index_granularity # rows per granule
3 primary_idx = []
4 for start in range(0, part.rows, g):
5 granule = part.rows_sorted[start : start + g]
6 primary_idx.append(granule[0].sort_key)
7 write_file('primary.idx', primary_idx)
8 # finer g → more entries, fatter index in RAM
9 # coarser g → fewer entries, but fatter granule reads
Planner.plan_with_index(query)
binary-search primary.idx for overlapping granules; load only those
1def plan_with_index(query):
2 lo, hi = bisect_range(
3 primary_idx, query.predicate_range,
4 )
5 surviving = list(range(lo, hi + 1))
6 for gid in surviving:
7 block = mark_file.locate(gid)
8 rows = decompress_and_read(block)
9 emit(vectorized_filter(rows, query))
10 # other granules: not a byte of .bin is read
index_size_math
why a billion-row primary.idx is ~1 MB and lives in RAM
1def primary_idx_bytes(part):
2 g = cfg.index_granularity
3 entries = part.rows // g # N_rows / granule_size
4 return entries * sizeof(sort_key_tuple)
5
6def rows_pulled_per_match(part):
7 # one surviving granule still loads g rows;
8 # vectorized filter discards the non-matches.
9 return cfg.index_granularity