Build a CDC pipeline (Debezium + outbox) (12 scenes)
Scene 06 · Snapshot, then stream
Debezium runs a consistent snapshot (op=r events), records the LSN at snapshot time, then switches to streaming from that LSN — so history and live updates stitch at one seam.
Previously

The slot tracks where the connector left off in the WAL — but the WAL only goes back so far, so a connector attaching to a database with a billion existing rows cannot replay history from the log alone. So how does Debezium hand a fresh consumer the entire table without the log to lean on?

Scene 06
Snapshot, then stream
Diagram
**Snapshot** — the initial bulk read of every existing row, emitted as op=r ('read') events; once exhausted, Debezium switches to streaming the WAL. **op=r/c/u/d** — operation codes Debezium tags onto every change event: r=read (snapshot), c=create, u=update, d=delete. **LSN** — a monotonic byte offset into the WAL that names every commit; the snapshot LSN is the seam where the connector switches from history to live. **change event** — the structured envelope Debezium emits per row change, carrying op, before, after, and source.
Snapshot → Stream · seam at snapshot LSNmode = initial · 0 of 100,000,000 rows snapshottedsnapshot at t0 · op=r0% drainedsnapshot LSNlsn=1000concurrent writers (during snapshot)WAL · op=c/u/d (post-seam)head=lsn:1000Kafka topic(waiting)Snapshot draining — op=r events tagged at snapshotLsn.
100M-row accounts table. Watch the snapshot bar drain emitting op=r events tagged with the snapshot LSN. Concurrent writes accumulate in the thin lane above the WAL strip. Once the bar empties, the seam is crossed and live op=u events from the WAL begin to flow.
Implementation
Connector.initialSnapshot
consistent bulk read, then handoff at snapshotLsn
1def initialSnapshot():
2 if snapshot.mode == 'never':
3 return streamFromLsn(currentLsn()) # skip history
4 tx = db.begin('REPEATABLE READ')
5 snapshotLsn = tx.currentLsn()
6 if snapshot.mode != 'no_data':
7 for table in captured.tables:
8 for row in tx.select_all(table):
9 emit(op='r', after=row, lsn=snapshotLsn)
10 tx.commit()
11 streamFromLsn(snapshotLsn) # seam
Connector.streamFromLsn
tail the WAL forward from the recorded seam
1def streamFromLsn(lsn):
2 cursor = replicationSlot.startReplication(
3 startLsn = lsn,
4 )
5 for change in cursor:
6 # change.lsn >= snapshotLsn — concurrent
7 # writes during snapshot replay here too
8 emit(
9 op = change.op, # c | u | d
10 before = change.before,
11 after = change.after,
12 lsn = change.lsn,
13 )
Consumer.expectDuplicatesAtSeam
the dedupe contract pushed onto downstream consumers
1# A row updated WHILE the snapshot was running surfaces twice:
2# 1. op=r tagged at snapshotLsn (from the bulk read)
3# 2. op=u tagged at change.lsn > snapshotLsn (from the WAL)
4# The connector cannot suppress (2) without risking a dropped
5# write, so dedupe is the consumer's job.
6def onEvent(evt):
7 key = (evt.table, evt.pk)
8 if evt.lsn <= seen.get(key, -1):
9 return # already applied — drop
10 apply(evt)
11 seen[key] = evt.lsn