Build a CDC pipeline (Debezium + outbox) (12 scenes)
Scene 09 · Outbox cleanup — pick your poison
Hard-delete after emit, tombstone + log compaction, partition-by-date drop — three trade-offs across WAL noise, race risk, and operational complexity. INSERT+DELETE same-tx collapses cleanup into the write.
Previously
The outbox closes the dual-write gap — but every business write now also writes a row whose entire purpose is to be deleted, so the table will grow forever unless we pick a cleanup strategy. Three strategies are on the table; each loses something different.
Scene 09
Outbox cleanup — pick your poison
Diagram
Three vertical lanes show the same write workload aged from t=0 to t=30d under three cleanup strategies: (1) hard-delete after the connector emits, (2) **tombstone** + **log compaction**, (3) date-partitioned drop. Each lane shows outbox-table rows, op=d events per hour in the WAL, and Kafka topic size at 1h / 24h / 30d. Term refresh — **outbox**: a dedicated table written inside the same transaction as the business write, born to die. **Single Message Transform (SMT)**: a per-message Kafka Connect plugin that reshapes events between connector and topic (Outbox Event Router maps aggregate_id→key, payload→value, and can filter op=d). New terms — **Tombstone**: a Kafka message with a null value that signals log compaction to drop the prior value for that key (this is the Kafka log-compaction tombstone — null payload — not a soft-delete row). **Log compaction**: Kafka's per-key 'keep latest' retention policy; when enabled on a topic, older values for the same key are eventually removed.
Three lanes, one workload. Watch each lane age across 30 days — the outbox row count, the op=d noise in the WAL, and the Kafka topic size diverge. Lane 1 stays small but pumps op=d events; lane 2 stays small via tombstones + compaction; lane 3 climbs daily and drops at month-end.
Implementation
Connector.hardDeleteAfterEmit
naive lane 1: connector emits, then deletes the row
1def hardDeleteAfterEmit(row):2 kafka.produce(3 topic = route(row),4 key = row.aggregate_id,5 value = row.payload,6 )7 # race: if the consumer lags, op=d may arrive8 # before the consumer has read op=c9 db.execute(10 'DELETE FROM outbox WHERE id = %s',11 row.id,12 )13 # WAL now records op=d; Outbox SMT must filter
Connector.tombstoneAndCompact
lane 2: emit a null-valued tombstone; Kafka compacts per key
1# topic configured cleanup.policy=compact2def tombstoneAndCompact(key):3 kafka.produce(4 topic = outbox_topic,5 key = key,6 value = None, # tombstone7 )8 # log compaction will eventually drop prior9 # values for this key — per-key 'keep latest'10 # race: lagging consumer may miss the value11 # before compaction reclaims it
Service.insertDeleteSameTx
lane 1 collapsed: cleanup folded into the write path
1def insertDeleteSameTx(event):2 db.execute('BEGIN')3 db.execute(4 'INSERT INTO outbox(...) VALUES (...)',5 event,6 )7 db.execute(8 'DELETE FROM outbox WHERE id = %s',9 event.id,10 )11 db.execute('COMMIT')12 # WAL has both records; Debezium emits op=c13 # then op=d atomically — sink ignores op=d