Build a CDC pipeline (Debezium + outbox) (12 scenes)
Scene 06 · Snapshot, then stream
Debezium runs a consistent snapshot (op=r events), records the LSN at snapshot time, then switches to streaming from that LSN — so history and live updates stitch at one seam.
Previously
The slot tracks where the connector left off in the WAL — but the WAL only goes back so far, so a connector attaching to a database with a billion existing rows cannot replay history from the log alone. So how does Debezium hand a fresh consumer the entire table without the log to lean on?
Scene 06
Snapshot, then stream
Diagram
**Snapshot** — the initial bulk read of every existing row, emitted as op=r ('read') events; once exhausted, Debezium switches to streaming the WAL. **op=r/c/u/d** — operation codes Debezium tags onto every change event: r=read (snapshot), c=create, u=update, d=delete. **LSN** — a monotonic byte offset into the WAL that names every commit; the snapshot LSN is the seam where the connector switches from history to live. **change event** — the structured envelope Debezium emits per row change, carrying op, before, after, and source.
100M-row accounts table. Watch the snapshot bar drain emitting op=r events tagged with the snapshot LSN. Concurrent writes accumulate in the thin lane above the WAL strip. Once the bar empties, the seam is crossed and live op=u events from the WAL begin to flow.
Implementation
Connector.initialSnapshot
consistent bulk read, then handoff at snapshotLsn
1def initialSnapshot():2 if snapshot.mode == 'never':3 return streamFromLsn(currentLsn()) # skip history4 tx = db.begin('REPEATABLE READ')5 snapshotLsn = tx.currentLsn()6 if snapshot.mode != 'no_data':7 for table in captured.tables:8 for row in tx.select_all(table):9 emit(op='r', after=row, lsn=snapshotLsn)10 tx.commit()11 streamFromLsn(snapshotLsn) # seam
Connector.streamFromLsn
tail the WAL forward from the recorded seam
1def streamFromLsn(lsn):2 cursor = replicationSlot.startReplication(3 startLsn = lsn,4 )5 for change in cursor:6 # change.lsn >= snapshotLsn — concurrent7 # writes during snapshot replay here too8 emit(9 op = change.op, # c | u | d10 before = change.before,11 after = change.after,12 lsn = change.lsn,13 )
Consumer.expectDuplicatesAtSeam
the dedupe contract pushed onto downstream consumers
1# A row updated WHILE the snapshot was running surfaces twice:2# 1. op=r tagged at snapshotLsn (from the bulk read)3# 2. op=u tagged at change.lsn > snapshotLsn (from the WAL)4# The connector cannot suppress (2) without risking a dropped5# write, so dedupe is the consumer's job.6def onEvent(evt):7 key = (evt.table, evt.pk)8 if evt.lsn <= seen.get(key, -1):9 return # already applied — drop10 apply(evt)11 seen[key] = evt.lsn