Vectorized execution — process batches, not tuples — Build a columnar OLAP store (ClickHouse / Druid style)

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 04 · Vectorized execution — process batches, not tuples

Tuple-at-a-time Volcano is interpreter overhead; processing 1024–8192 column values per call lets the CPU emit SIMD inner loops.

Previously

The bytes on disk are now tiny. But pulling tiny bytes through the CPU one row at a time still leaves a 50x performance win on the table — it depends on how the executor walks those bytes.

Scene 04

Vectorized execution: process batches, not tuples

Watch

Diagram

Two stacked execution pipelines run the same query (Scan → Filter → Aggregate) on the same 1M-row column. The TOP panel walks rows one tuple at a time — the classic Volcano iterator model: every operator hop is a virtual `next()` call, and the CPU's SIMD lanes mostly sit dark. The BOTTOM panel processes a whole **batch / Block** of 8192 column values per call — dispatch cost is paid once per batch, and the inner loop autovectorizes to SIMD. The shared race bar at the bottom shows how many rows each side has finished against the 1M-row target. Above the bottom panel, a warning badge appears when a per-row UDF poisons the vector kernel.

Sources

Same query on both panels: scan a column, filter, aggregate over 1M rows. The top panel walks one tuple at a time; the bottom panel walks 8192-row batches. Watch the race bar — and the virtual-call counter on each side.

Implementation

Volcano.executeQuery()

tuple-at-a-time: one virtual next() per row per operator

1def execute(plan):
2    op = plan.root          # Aggregate -> Filter -> Scan
3    while True:
4        row = op.next()     # virtual dispatch, every row
5        if row is EOF:
6            break
7        # 3M virtual calls for 1M rows (3 operators deep)
8        # SIMD lanes idle — no contiguous array to vectorize
9        accumulate(row)

Vectorized.executeQuery()

batch-at-a-time: one dispatch per 8192 values

1def execute(plan):
2    op = plan.root          # operators consume + produce Blocks
3    while True:
4        block = op.nextBlock()   # 8192 column values per call
5        if block is EOF:
6            break
7        # one virtual call per BATCH, not per row
8        # block.col is a contiguous array -> SIMD kernel
9        accumulateBlock(block)

Filter.apply(block)

the SIMD kernel — or the per-row escape hatch

1def apply(block):
2    if predicate.isNativeKernel():
3        # tight loop -> AVX2 (8 int32) / AVX-512 (16)
4        for i in range(block.n):
5            out[i] = block.col[i] > threshold
6        return block.select(out)
7    # opaque UDF: planner can't prove what it does
8    for i in range(block.n):
9        out[i] = python_udf(block.col[i])  # FFI per row
10    return block.select(out)

PreviousCompression — the column store's superpower NextWrites must be bulk, not per-row