Granules and the sparse primary index

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 09 · Granules and the sparse primary index

Rows are grouped into 8192-row granules; one index entry per granule keeps the whole table's index in RAM, binary-searched in microseconds.

Previously

If rows inside a part are sorted, the engine doesn't need an index entry per row — it can keep one entry per group of 8192 rows and still find a range fast. That gives us an index small enough to keep entirely in memory.

Scene 09

Granules and the sparse primary index

Watch

Diagram

The horizontal strip is ONE part, sliced into **granules** — fixed-size blocks of 8,192 rows that are the unit the index points at. The dashed panel above is the **sparse primary index** (primary.idx) — one entry per granule, not per row; the whole thing is small enough to live in RAM (1B rows ≈ 1 MB index). A WHERE predicate triggers a binary search across the index entries; the surviving granule range lights green, and only those granules read bytes from `.bin`. The mark file at the bottom translates granule N into a byte offset.

Sources

A WHERE service='api' AND ts > T query enters. The primary.idx (in RAM) is binary-searched in O(log G) ≈ 7 hops; granules 42..47 survive and the rest stay dim. Watch the four stages.

Implementation

Part.build_primary_index()

one entry per granule of N rows — that's the whole trick

1def build_primary_index(part):
2    g = cfg.index_granularity  # rows per granule
3    primary_idx = []
4    for start in range(0, part.rows, g):
5        granule = part.rows_sorted[start : start + g]
6        primary_idx.append(granule[0].sort_key)
7    write_file('primary.idx', primary_idx)
8    # finer g → more entries, fatter index in RAM
9    # coarser g → fewer entries, but fatter granule reads

Planner.plan_with_index(query)

binary-search primary.idx for overlapping granules; load only those

1def plan_with_index(query):
2    lo, hi = bisect_range(
3        primary_idx, query.predicate_range,
4    )
5    surviving = list(range(lo, hi + 1))
6    for gid in surviving:
7        block = mark_file.locate(gid)
8        rows  = decompress_and_read(block)
9        emit(vectorized_filter(rows, query))
10    # other granules: not a byte of .bin is read

index_size_math

why a billion-row primary.idx is ~1 MB and lives in RAM

1def primary_idx_bytes(part):
2    g = cfg.index_granularity
3    entries = part.rows // g            # N_rows / granule_size
4    return entries * sizeof(sort_key_tuple)
5 
6def rows_pulled_per_match(part):
7    # one surviving granule still loads g rows;
8    # vectorized filter discards the non-matches.
9    return cfg.index_granularity

PreviousORDER BY — filtering becomes range-scan NextSkip indexes — prove a granule has no match, then skip it