Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 04 · Vectorized execution — process batches, not tuples
Tuple-at-a-time Volcano is interpreter overhead; processing 1024–8192 column values per call lets the CPU emit SIMD inner loops.
Previously

The bytes on disk are now tiny. But pulling tiny bytes through the CPU one row at a time still leaves a 50x performance win on the table — it depends on how the executor walks those bytes.

Scene 04
Vectorized execution: process batches, not tuples
Diagram
Two stacked execution pipelines run the same query (Scan → Filter → Aggregate) on the same 1M-row column. The TOP panel walks rows one tuple at a time — the classic Volcano iterator model: every operator hop is a virtual `next()` call, and the CPU's SIMD lanes mostly sit dark. The BOTTOM panel processes a whole **batch / Block** of 8192 column values per call — dispatch cost is paid once per batch, and the inner loop autovectorizes to SIMD. The shared race bar at the bottom shows how many rows each side has finished against the 1M-row target. Above the bottom panel, a warning badge appears when a per-row UDF poisons the vector kernel.
Volcano (tuple-at-a-time)ScanoperatorFilteroperatorAggregateoperatornext()next()CPU SIMD1/8 litvirtual calls: 00.1M rows/secVectorized (batch = 8192)ScanoperatorFilteroperatorAggregateoperator8192-vec8192-vecCPU SIMD8/8 litvirtual calls: 31.0M rows/secROWS PROCESSED · target 1,000,000Volcano0%Vectorized0%8192 column values per call — one dispatch, one SIMD inner loop, ~10x throughput.
Same query on both panels: scan a column, filter, aggregate over 1M rows. The top panel walks one tuple at a time; the bottom panel walks 8192-row batches. Watch the race bar — and the virtual-call counter on each side.
Implementation
Volcano.executeQuery()
tuple-at-a-time: one virtual next() per row per operator
1def execute(plan):
2 op = plan.root # Aggregate -> Filter -> Scan
3 while True:
4 row = op.next() # virtual dispatch, every row
5 if row is EOF:
6 break
7 # 3M virtual calls for 1M rows (3 operators deep)
8 # SIMD lanes idle — no contiguous array to vectorize
9 accumulate(row)
Vectorized.executeQuery()
batch-at-a-time: one dispatch per 8192 values
1def execute(plan):
2 op = plan.root # operators consume + produce Blocks
3 while True:
4 block = op.nextBlock() # 8192 column values per call
5 if block is EOF:
6 break
7 # one virtual call per BATCH, not per row
8 # block.col is a contiguous array -> SIMD kernel
9 accumulateBlock(block)
Filter.apply(block)
the SIMD kernel — or the per-row escape hatch
1def apply(block):
2 if predicate.isNativeKernel():
3 # tight loop -> AVX2 (8 int32) / AVX-512 (16)
4 for i in range(block.n):
5 out[i] = block.col[i] > threshold
6 return block.select(out)
7 # opaque UDF: planner can't prove what it does
8 for i in range(block.n):
9 out[i] = python_udf(block.col[i]) # FFI per row
10 return block.select(out)