Skip indexes — prove a granule has no match, then skip it — Build a columnar OLAP store (ClickHouse / Druid style)

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 10 · Skip indexes — prove a granule has no match, then skip it

Per-granule sketches (minmax / set / bloom) prove 'this granule cannot contain a match' and skip reading it — only useful when values are clustered.

Previously

The sparse index narrows queries by the sort-key prefix. For predicates on OTHER columns, a second kind of index proves 'this granule definitely doesn't contain X' so the engine can skip it without reading the data.

Scene 10

Skip indexes — prove a granule lacks X

Watch

Diagram

Each cell along the strip is a granule (the 8192-row block from scene 9). The badge row above the strip is the per-granule **skip index (data-skipping index)** — a tiny secondary sketch the engine probes BEFORE touching column data; if the badge proves 'definitely no match', the granule is greyed out and never read. The header shows the predicate, the chosen index type, and the data distribution. The selected index type here is **bloom filter (skip variant)** — a space-efficient probabilistic sketch that says 'definitely not in this granule' or 'maybe in this granule', never 'yes'. The GRANULES READ counter at the bottom is the load-bearing number: it falls only when granules can be proved absent, which depends on both the index type AND the data's clustering.

Sources

skip index badge → proves 'definitely not here'

bloom filter — 'definitely not' or 'maybe', never 'yes'

100 granules, each topped with a bloom_filter skip index on user_id. The query WHERE user_id = 12345 probes every badge: ~94 say 'definitely no' and grey out, 5 stay 'maybe' (one true hit + a few false positives), 1 is the actual row. The GRANULES READ counter shows 6 instead of 100.

Implementation

Skip-index DDL

negative, secondary, per-column

1ALTER TABLE events
2  ADD INDEX user_id_bf user_id
3  TYPE bloom_filter(0.01)
4  GRANULARITY 4;
5-- 0.01 = target false-positive rate
6-- GRANULARITY 4 = one entry per 4 * 8192 = 32768 rows

Read-path pruning

probe the badge, skip the granule

1def scan(predicate, candidate_granules, skip_index):
2    for g in candidate_granules:
3        if skip_index is None or \
4           skip_index.maybe_contains(g, predicate):
5            read_columns(g, predicate)   # 'maybe' → load + filter
6        else:
7            skip(g)                      # proven absent, no IO

BloomFilter.maybe_contains(value)

k hashes, one absent bit proves 'definitely not'

1def maybe_contains(value):
2    for h in hash_fns:                  # k independent hashes
3        if bits[h(value) % size] == 0:
4            return False                # definitely-not — skip
5    return True                         # maybe — must read

PreviousGranules and the sparse primary index NextMaterialized views are INSERT triggers