Build a columnar OLAP store (ClickHouse / Druid style) (13 scenes)
Scene 06 · A part — one batch, frozen on disk
Each batched write lands as an immutable directory of column files plus an index — a part. A table is a stack of parts.
Previously

If every batch becomes its own self-contained file on disk, we need a name and a shape for that unit — because everything operational from here on out is about how many of these units a query has to touch.

Scene 06
A part: one batch, frozen on disk
Diagram
Left panel: a TABLE container labelled 'events' filled with a vertical stack of part tiles, each carrying its name, timestamp range, row count, level, and an 'immutable' badge. Right panel: the PART DIRECTORY of the currently expanded part — one .bin file per column (ts, country, latency, user_id), the matching .mrk2 marks, primary.idx (sparse index), and metadata (columns.txt, checksums.txt, count.txt). A query badge above the stack appears when there are many parts: 'each query asks every part'. An UPDATE shakes the targeted part and drops a new fresh part on top; the original is untouched.
TABLEevents0 partsPART DIRECTORYno part selectedtable is empty — about to write the first batch
Write one batch. It lands as a single self-contained part directory — one .bin per column, a primary.idx, and metadata. Write a second batch and a second part appears. The table is now exactly two parts; neither will ever be mutated in place.
Implementation
MergeTree.write_batch_to_part
one batch in → one self-contained part directory out
1def write_batch_to_part(rows, columns, sort_key, part_id):
2 rows = sort(rows, by=sort_key)
3 part_dir = mkdir(f'{table_path}/{part_id}')
4 for col in columns: # one .bin per column
5 values = project(rows, col)
6 write_compressed(f'{part_dir}/{col}.bin', values)
7 write_marks(f'{part_dir}/{col}.mrk2', values)
8 write_primary_idx(f'{part_dir}/primary.idx',
9 first_of_each_granule(rows, sort_key))
10 write_text(f'{part_dir}/columns.txt', columns)
11 write_text(f'{part_dir}/checksums.txt', checksums(part_dir))
12 fsync_dir(part_dir) # part is now immutable
MergeTree.query_table
a table is a bag of parts — fan out, then merge results
1def query_table(table, predicate, projection):
2 parts = list_parts(table) # every part on disk
3 streams = []
4 for part in parts: # ask every part
5 if part.minmax_excludes(predicate):
6 continue # pruned by metadata
7 granules = scan_primary_idx(part, predicate)
8 for col in projection:
9 streams.append(read_bin(part, col, granules))
10 return merge_streams(streams)
MergeTree.update_row
UPDATE is a part rewrite, not a row edit
1def update_row(table, predicate, assignments):
2 # parts are immutable — we never reopen an existing .bin
3 matched = []
4 for part in list_parts(table):
5 rows = scan(part, predicate)
6 for r in rows:
7 matched.append(apply(r, assignments))
8 # write a NEW part containing the rewritten rows
9 new_id = next_part_id(table) # e.g. all_11_11_0
10 write_batch_to_part(matched, columns, sort_key, new_id)
11 # original part stays on disk, untouched, until merge