ORDER BY — filtering becomes range-scan — Build a columnar OLAP store (ClickHouse / Druid style)

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 08 · ORDER BY — filtering becomes range-scan

Rows inside a part are sorted by ORDER BY; WHERE on a leftmost-prefix column is a contiguous range, anything else is a full scan.

Previously

Merge keeps the part count bounded, but what a query does inside each part — scan everything, or jump to a range — depends on how the rows were sorted when the part was written.

Scene 08

ORDER BY: filtering becomes range-scan

Watch

Diagram

One part shown as a horizontal strip of 60 row tiles in physical on-disk order, colored by the leading sort-key column. A WHERE clause at the top of the strip either draws a contiguous green range bracket (the engine reads one slice) or scattered red arrows (the engine has to scan every tile). Below the strip, a badge reports the cardinality of the leading sort-key column — the lever that decides whether ORDER BY can prune at all.

Sources

Watch the part fill up with 60 row tiles in physical on-disk order — all the api rows first, then web, then cron, because the part was written with ORDER BY (service, ts). Once the strip lands, WHERE service='api' lights up the first 20 tiles as a contiguous range — the engine reads that slice and nothing else.

Implementation

Planner.plan_query(part, where)

walk ORDER BY left-to-right, match WHERE, pick range vs scan

1def plan_query(part, where):
2    sort_cols = part.order_by  # e.g. ('service', 'ts')
3    # Find the longest prefix of sort_cols that WHERE
4    # constrains with an equality / range predicate.
5    prefix = []
6    for col in sort_cols:
7        if where.constrains(col):
8            prefix.append(col)
9        else:
10            break  # gap kills everything to the right
11    if not prefix:
12        return FullPartScan()      # non-prefix WHERE
13    return RangeScanBySortKey(prefix, where)

RangeScanBySortKey.execute(part, where)

on (service, ts): seek to service='api', read until it flips

1def range_scan_by_sort_key(part, where):
2    # Binary-search the in-memory primary.idx to slice
3    # the granule range where the prefix could live.
4    lo = bisect_left(part.sort_columns, where.prefix_lo)
5    hi = bisect_right(part.sort_columns, where.prefix_hi)
6    # Seek to first row where service = 'api',
7    # then sequential read until service != 'api'.
8    for row in part.rows[lo:hi]:
9        if where.matches(row):
10            yield row

# Leftmost-prefix rule

the schema-design law the planner is enforcing

1# ORDER BY (A, B):
2#   WHERE A = ...           -> range scan (fast)
3#   WHERE A = .. AND B = .. -> range scan (fastest)
4#   WHERE B = ...           -> full scan (B's range
5#                              restarts inside every A)
6#
7# ORDER BY (user_id, ts) with user_id unique per row:
8#   primary.idx has one entry per granule, each a
9#   different user_id -> no granule can be skipped,
10#   every WHERE degenerates to a full part scan.
11#   This is the canonical ClickHouse anti-pattern.

PreviousMerge — and the 'too many parts' crash NextGranules and the sparse primary index