Hot, warm, cold, frozen

Build a distributed logging stack (ELK / Loki) (12 scenes)

Scene 07 · Hot, warm, cold, frozen

Two orders of magnitude in cost between NVMe and Deep Archive force tiering. ELK has four ILM phases; Loki collapses to S3 from day one.

Previously

High-cardinality data has to live in the body, and the body is BIG and most of it is OLD. NVMe is too expensive to hold months of body — economics force tiering.

Scene 07

Hot, warm, cold, frozen

Watch

Diagram

A vertical four-rung ladder labelled HOT / WARM / COLD / FROZEN on the LEFT lane (ELK Index Lifecycle Management). Each rung shows hardware, latency, replica count, and per-GB-month cost. An ILM clock above the ladder ticks days; an index tile is born on HOT and slides DOWN rung-by-rung as it ages. A 'searchable snapshot' callout pins to COLD/FROZEN — the primitive that lets those rungs exist without re-indexing. To the right, a SECOND lane labelled LOKI shows the entire body collapsed onto a single S3-Standard rung from day 0. Cost-per-month meters at the bottom compare ELK total vs Loki total for the same volume × retention. INGEST-STALL badge fires on HOT when the policy is misconfigured (HOT-only with no demotion targets).

Day 0 — HOT (NVMe, ~$0.10/GB-mo)

Watch one index tile age. It's born today on HOT (NVMe, replicated, ms latency). The ILM clock advances 7 days — it slides to WARM (force-merged, fewer replicas). 30 days — it slides to COLD (a searchable snapshot in S3, fully mounted, ~50% disk savings, no replicas). 90 days — it slides to FROZEN (small NVMe cache, partially mounted from S3, up to 20× warm capacity). To the right, the Loki lane shows the same body parked on S3-Standard from day 0 — no movement, ever.

Implementation

ILM.tick

the daily clock — for each index, evaluate phase actions

1# runs once per day on the master node
2def ilm_tick():
3    for index in cluster.indices:
4        age = now() - index.creation_date
5        policy = index.ilm_policy
6        if age >= policy.hot.max_age:      # rollover
7            rollover(index)
8        if age >= policy.warm.min_age:
9            demote(index, phase='warm')
10        if age >= policy.cold.min_age:
11            demote(index, phase='cold')
12        if age >= policy.frozen.min_age:
13            demote(index, phase='frozen')

ILM.demote

the phase transition — force_merge, snapshot, partial mount

1def demote(index, phase):
2    if phase == 'warm':
3        force_merge(index, max_num_segments=1)
4        set_replicas(index, 0)
5        allocation.require(data_warm)
6    elif phase == 'cold':
7        snap = searchable_snapshot(index, repo=s3)
8        mount(snap, type='full_copy')   # ~50% disk savings
9        delete_local_index(index)
10    elif phase == 'frozen':
11        snap = searchable_snapshot(index, repo=s3)
12        mount(snap, type='partial')     # NVMe cache + S3

Loki.compact_index

Loki's 'one tier' reality — only the index gets compacted

1# chunks were written to S3 the moment they flushed —
2# the body never moves. Only the index is compacted.
3def compact_index():
4    shards = list_index_shards(object_store)
5    for day in shards.by_day():
6        merged = merge_boltdb_shards(day)  # → TSDB
7        upload(merged, object_store)
8        delete(day.original_shards)
9    # chunks: untouched, still on S3-Standard from day 0

PreviousCardinality is the killer NextRetention vs deletion — the index has to forget