Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 12 · Joins — denormalize or pay
Column-store joins are RAM-bound or shuffle-bound; the canonical OLAP answer is to denormalize at write time — storage is cheap, latency is the constraint.
Previously
Materialized views collapse one big scan into a tiny one. But the reader's other OLTP instinct — JOIN across normalized tables — hits a different kind of wall on a column store, and the answer feels wrong.
Scene 12
Joins: denormalize or pay
Diagram
Three panels for the same logical query `SELECT country, tier, count() FROM events`. **Star schema · hash join** — build a hash table on the users dim, probe per event row; memory tracks the right side. **Denormalized wide table** — collapse the JOIN at write time by baking dim columns (user_tier, country_code, signup_cohort, device) into every event row. **Dictionary · dictGet()** — load the small side into RAM as a key→value map and replace JOIN with a pointwise lookup. Each panel shows a query-latency meter plus the side cost it pays (RAM, disk multiplier, or refresh).
↑ hash table built on the RIGHT side (users) — memory tracks dim size
→ probe per event row — ~4 s wall-clock
The star-schema panel runs the OLTP-instinct query: hash table builds from the 1M-row users dim (200 MB in RAM), then events probes it row by row. Wall-clock pins near 4 s — the hash join is RAM-bound by the right side, not by how cleverly you wrote the SQL.
Implementation
hash_join_star_schema()
build hash table on RIGHT (users), probe per LEFT (event) row
1def hash_join_star_schema():2 # algorithm = 'hash' (ClickHouse default)3 ht = {} # hash table in RAM4 for u in scan(users): # build on RIGHT5 ht[u.id] = u.tier # ~1M rows -> ~200 MB6 out = Counter()7 for e in scan(events): # probe per LEFT row8 tier = ht.get(e.user_id) # one lookup per event9 out[(e.country, tier)] += 110 return out # ~4 s; RAM-bound on |users|
denormalized_query()
no hash table; only column reads off a wide fact table
1def denormalized_query():2 # tier was baked into the row at write time3 out = Counter()4 for e in scan(events_wide, # only the columns we need5 cols=['country', 'tier']):6 out[(e.country, e.tier)] += 17 return out # ~0.2 s; one wide scan89# update cost: a user upgrading tier => rewrite every10# event row that user ever produced (parts are immutable).
dictionary_lookup_query()
in-RAM dict pre-loaded once; dictGet per row replaces the JOIN
1def dictionary_lookup_query():2 # users_dict pre-loaded once into RAM (~30 MB)3 # algorithm = 'direct' (no hash-build at query time)4 out = Counter()5 for e in scan(events, cols=['country', 'user_id']):6 tier = dictGet('users_dict', 'tier', e.user_id)7 out[(e.country, tier)] += 18 return out # ~0.4 s910# refresh: LIFETIME(MIN 300 MAX 600) reloads from source11# every ~5-10 min — no event-row rewrites on tier change.