Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 10 · Skip indexes — prove a granule has no match, then skip it
Per-granule sketches (minmax / set / bloom) prove 'this granule cannot contain a match' and skip reading it — only useful when values are clustered.
Previously

The sparse index narrows queries by the sort-key prefix. For predicates on OTHER columns, a second kind of index proves 'this granule definitely doesn't contain X' so the engine can skip it without reading the data.

Scene 10
Skip indexes — prove a granule lacks X
Diagram
Each cell along the strip is a granule (the 8192-row block from scene 9). The badge row above the strip is the per-granule **skip index (data-skipping index)** — a tiny secondary sketch the engine probes BEFORE touching column data; if the badge proves 'definitely no match', the granule is greyed out and never read. The header shows the predicate, the chosen index type, and the data distribution. The selected index type here is **bloom filter (skip variant)** — a space-efficient probabilistic sketch that says 'definitely not in this granule' or 'maybe in this granule', never 'yes'. The GRANULES READ counter at the bottom is the load-bearing number: it falls only when granules can be proved absent, which depends on both the index type AND the data's clustering.
PREDICATEWHERE user_id = 12345INDEX TYPEbloomVALUE DISTclustered100 granules · grey = skipped · yellow = maybe · green = hitGRANULES READ5 / 100no: 95 · maybe: 4 · hit: 194% pruned — index earns its keepBloom + clustered values: ~94 granules grey out, 5 stay 'maybe', 1 has the row. 6/100.
skip index badge → proves 'definitely not here'
bloom filter — 'definitely not' or 'maybe', never 'yes'
100 granules, each topped with a bloom_filter skip index on user_id. The query WHERE user_id = 12345 probes every badge: ~94 say 'definitely no' and grey out, 5 stay 'maybe' (one true hit + a few false positives), 1 is the actual row. The GRANULES READ counter shows 6 instead of 100.
Implementation
Skip-index DDL
negative, secondary, per-column
1ALTER TABLE events
2 ADD INDEX user_id_bf user_id
3 TYPE bloom_filter(0.01)
4 GRANULARITY 4;
5-- 0.01 = target false-positive rate
6-- GRANULARITY 4 = one entry per 4 * 8192 = 32768 rows
Read-path pruning
probe the badge, skip the granule
1def scan(predicate, candidate_granules, skip_index):
2 for g in candidate_granules:
3 if skip_index is None or \
4 skip_index.maybe_contains(g, predicate):
5 read_columns(g, predicate) # 'maybe' → load + filter
6 else:
7 skip(g) # proven absent, no IO
BloomFilter.maybe_contains(value)
k hashes, one absent bit proves 'definitely not'
1def maybe_contains(value):
2 for h in hash_fns: # k independent hashes
3 if bits[h(value) % size] == 0:
4 return False # definitely-not — skip
5 return True # maybe — must read