Joins — denormalize or pay — Build a columnar OLAP store (ClickHouse / Druid style)

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 12 · Joins — denormalize or pay

Column-store joins are RAM-bound or shuffle-bound; the canonical OLAP answer is to denormalize at write time — storage is cheap, latency is the constraint.

Previously

Materialized views collapse one big scan into a tiny one. But the reader's other OLTP instinct — JOIN across normalized tables — hits a different kind of wall on a column store, and the answer feels wrong.

Scene 12

Joins: denormalize or pay

Watch

Diagram

Three panels for the same logical query `SELECT country, tier, count() FROM events`. **Star schema · hash join** — build a hash table on the users dim, probe per event row; memory tracks the right side. **Denormalized wide table** — collapse the JOIN at write time by baking dim columns (user_tier, country_code, signup_cohort, device) into every event row. **Dictionary · dictGet()** — load the small side into RAM as a key→value map and replace JOIN with a pointwise lookup. Each panel shows a query-latency meter plus the side cost it pays (RAM, disk multiplier, or refresh).

Sources

↑ hash table built on the RIGHT side (users) — memory tracks dim size

→ probe per event row — ~4 s wall-clock

The star-schema panel runs the OLTP-instinct query: hash table builds from the 1M-row users dim (200 MB in RAM), then events probes it row by row. Wall-clock pins near 4 s — the hash join is RAM-bound by the right side, not by how cleverly you wrote the SQL.

Implementation

hash_join_star_schema()

build hash table on RIGHT (users), probe per LEFT (event) row

1def hash_join_star_schema():
2    # algorithm = 'hash' (ClickHouse default)
3    ht = {}                          # hash table in RAM
4    for u in scan(users):            # build on RIGHT
5        ht[u.id] = u.tier            # ~1M rows -> ~200 MB
6    out = Counter()
7    for e in scan(events):           # probe per LEFT row
8        tier = ht.get(e.user_id)     # one lookup per event
9        out[(e.country, tier)] += 1
10    return out                       # ~4 s; RAM-bound on |users|

denormalized_query()

no hash table; only column reads off a wide fact table

1def denormalized_query():
2    # tier was baked into the row at write time
3    out = Counter()
4    for e in scan(events_wide,       # only the columns we need
5                  cols=['country', 'tier']):
6        out[(e.country, e.tier)] += 1
7    return out                       # ~0.2 s; one wide scan
8 
9# update cost: a user upgrading tier => rewrite every
10# event row that user ever produced (parts are immutable).

dictionary_lookup_query()

in-RAM dict pre-loaded once; dictGet per row replaces the JOIN

1def dictionary_lookup_query():
2    # users_dict pre-loaded once into RAM (~30 MB)
3    # algorithm = 'direct' (no hash-build at query time)
4    out = Counter()
5    for e in scan(events, cols=['country', 'user_id']):
6        tier = dictGet('users_dict', 'tier', e.user_id)
7        out[(e.country, tier)] += 1
8    return out                       # ~0.4 s
9 
10# refresh: LIFETIME(MIN 300 MAX 600) reloads from source
11# every ~5-10 min — no event-row rewrites on tier change.

PreviousMaterialized views are INSERT triggers NextDesign canvas — pick a workload, ship a schema