Build a distributed logging stack (ELK / Loki) (12 scenes)
Scene 05 · Inverted index vs labels-only index
ELK tokenises every value into a per-term posting list; Loki hashes the label-set into a stream id and appends the line verbatim — heavy index + small body vs tiny index + verbatim body.
Previously
Same line, two vocabularies — **fields** in ELK, **labels** in Loki. Now watch what each system actually writes to disk when that one line lands.
Scene 05
Inverted index vs labels-only index
Diagram
Top: one source log line as a pill — the verbatim string an api pod in prod just emitted. The pill forks via a Y-arrow into two parallel disk panels. LEFT (ELK): a Lucene segment tile (max 5 GB) holding an **inverted-index** sub-panel — a vertical list of TERMS (api, ERROR, checkout_failed, 42, …), each with its sorted POSTING LIST of doc-ids. A mapping badge above tracks the field budget against the 1000-field limit. RIGHT (Loki): a **chunk** tile (compressed, target 1.5 MB) for the {app=api,env=prod} stream, with the line appended VERBATIM to the recent-lines list — not parsed, not tokenised. Below it, a tiny INDEX panel shows ONE entry: ({app=api,env=prod}, time-window) → chunk_id. Crucially, nothing about user_id=42 or event=checkout_failed appears in the index. Bytes-on-disk meters at the bottom land ELK at ~1.3× raw and Loki at ~0.2× raw.
One source line, two disks. We want to find every line containing `ERROR` later — so we pre-build a lookup from word → list of lines containing it. That structure is called an **inverted index**, and it's what the LEFT (ELK) panel is filling, one token at a time. The RIGHT (Loki) panel does something different: it indexes ONLY the labels — `{app=api,env=prod}` — and appends the line VERBATIM to a per-stream blob called a **chunk**. Watch both sides land the same line.
Implementation
Lucene.index_doc
ELK side: tokenise every field, grow a posting list per term
1def index_doc(doc, segment):2 for field, value in doc.items():3 if mapping.fields_used >= 1000: # total_fields.limit4 raise MappingError('mapping explosion')5 mapping.ensure_field(field)6 for term in analyzer.tokenize(value):7 postings[term].append_sorted(doc.id)8 segment.store_source(doc) # original kept for _source9 if segment.size_bytes >= 5 * GB: # max_merged_segment10 segment.seal()
Loki.write_line
Loki side: hash label-set → stream_id; append line verbatim
1def write_line(labels, line, ts):2 stream_id = hash(canonicalize(labels))3 if stream_id not in streams:4 streams[stream_id] = Chunk(labels=labels)5 chunk = streams[stream_id]6 chunk.append(ts, line) # raw bytes, NOT parsed7 if chunk.compressed_size >= 1_500_000: # chunk_target_size8 flush_chunk(stream_id, chunk)
Loki.flush_chunk
close on size or idle, upload, write ONE index entry
1def flush_chunk(stream_id, chunk):2 # closes on size (1.5 MB compressed), age (2h), idle (30m)3 if chunk.idle_for() > 30 * MINUTES:4 chunk.seal()5 blob = chunk.encode() # compressed block-by-block6 chunk_id = object_store.put(blob)7 # ONE index row — labels only, body never parsed8 index.append(stream_id, chunk.time_range, chunk_id)9 del streams[stream_id]