Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 04 · Vectorized execution — process batches, not tuples
Tuple-at-a-time Volcano is interpreter overhead; processing 1024–8192 column values per call lets the CPU emit SIMD inner loops.
Previously
The bytes on disk are now tiny. But pulling tiny bytes through the CPU one row at a time still leaves a 50x performance win on the table — it depends on how the executor walks those bytes.
Scene 04
Vectorized execution: process batches, not tuples
Diagram
Two stacked execution pipelines run the same query (Scan → Filter → Aggregate) on the same 1M-row column. The TOP panel walks rows one tuple at a time — the classic Volcano iterator model: every operator hop is a virtual `next()` call, and the CPU's SIMD lanes mostly sit dark. The BOTTOM panel processes a whole **batch / Block** of 8192 column values per call — dispatch cost is paid once per batch, and the inner loop autovectorizes to SIMD. The shared race bar at the bottom shows how many rows each side has finished against the 1M-row target. Above the bottom panel, a warning badge appears when a per-row UDF poisons the vector kernel.
Same query on both panels: scan a column, filter, aggregate over 1M rows. The top panel walks one tuple at a time; the bottom panel walks 8192-row batches. Watch the race bar — and the virtual-call counter on each side.
Implementation
Volcano.executeQuery()
tuple-at-a-time: one virtual next() per row per operator
1def execute(plan):2 op = plan.root # Aggregate -> Filter -> Scan3 while True:4 row = op.next() # virtual dispatch, every row5 if row is EOF:6 break7 # 3M virtual calls for 1M rows (3 operators deep)8 # SIMD lanes idle — no contiguous array to vectorize9 accumulate(row)
Vectorized.executeQuery()
batch-at-a-time: one dispatch per 8192 values
1def execute(plan):2 op = plan.root # operators consume + produce Blocks3 while True:4 block = op.nextBlock() # 8192 column values per call5 if block is EOF:6 break7 # one virtual call per BATCH, not per row8 # block.col is a contiguous array -> SIMD kernel9 accumulateBlock(block)
Filter.apply(block)
the SIMD kernel — or the per-row escape hatch
1def apply(block):2 if predicate.isNativeKernel():3 # tight loop -> AVX2 (8 int32) / AVX-512 (16)4 for i in range(block.n):5 out[i] = block.col[i] > threshold6 return block.select(out)7 # opaque UDF: planner can't prove what it does8 for i in range(block.n):9 out[i] = python_udf(block.col[i]) # FFI per row10 return block.select(out)