A part — one batch, frozen on disk — Build a columnar OLAP store (ClickHouse / Druid style)

Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)

Scene 06 · A part — one batch, frozen on disk

Each batched write lands as an immutable directory of column files plus an index — a part. A table is a stack of parts.

Previously

If every batch becomes its own self-contained file on disk, we need a name and a shape for that unit — because everything operational from here on out is about how many of these units a query has to touch.

Scene 06

A part: one batch, frozen on disk

Watch

Diagram

Left panel: a TABLE container labelled 'events' filled with a vertical stack of part tiles, each carrying its name, timestamp range, row count, level, and an 'immutable' badge. Right panel: the PART DIRECTORY of the currently expanded part — one .bin file per column (ts, country, latency, user_id), the matching .mrk2 marks, primary.idx (sparse index), and metadata (columns.txt, checksums.txt, count.txt). A query badge above the stack appears when there are many parts: 'each query asks every part'. An UPDATE shakes the targeted part and drops a new fresh part on top; the original is untouched.

Write one batch. It lands as a single self-contained part directory — one .bin per column, a primary.idx, and metadata. Write a second batch and a second part appears. The table is now exactly two parts; neither will ever be mutated in place.

Implementation

MergeTree.write_batch_to_part

one batch in → one self-contained part directory out

1def write_batch_to_part(rows, columns, sort_key, part_id):
2    rows = sort(rows, by=sort_key)
3    part_dir = mkdir(f'{table_path}/{part_id}')
4    for col in columns:                  # one .bin per column
5        values = project(rows, col)
6        write_compressed(f'{part_dir}/{col}.bin', values)
7        write_marks(f'{part_dir}/{col}.mrk2', values)
8    write_primary_idx(f'{part_dir}/primary.idx',
9                      first_of_each_granule(rows, sort_key))
10    write_text(f'{part_dir}/columns.txt', columns)
11    write_text(f'{part_dir}/checksums.txt', checksums(part_dir))
12    fsync_dir(part_dir)                  # part is now immutable

MergeTree.query_table

a table is a bag of parts — fan out, then merge results

1def query_table(table, predicate, projection):
2    parts = list_parts(table)            # every part on disk
3    streams = []
4    for part in parts:                   # ask every part
5        if part.minmax_excludes(predicate):
6            continue                     # pruned by metadata
7        granules = scan_primary_idx(part, predicate)
8        for col in projection:
9            streams.append(read_bin(part, col, granules))
10    return merge_streams(streams)

MergeTree.update_row

UPDATE is a part rewrite, not a row edit

1def update_row(table, predicate, assignments):
2    # parts are immutable — we never reopen an existing .bin
3    matched = []
4    for part in list_parts(table):
5        rows = scan(part, predicate)
6        for r in rows:
7            matched.append(apply(r, assignments))
8    # write a NEW part containing the rewritten rows
9    new_id = next_part_id(table)         # e.g. all_11_11_0
10    write_batch_to_part(matched, columns, sort_key, new_id)
11    # original part stays on disk, untouched, until merge

PreviousWrites must be bulk, not per-row NextMerge — and the 'too many parts' crash