Engineering

Log-based vs query-based CDC: which method to use

All change data capture answers the same question — "what changed, and when?" — but the three methods answer it very differently. Log-based CDC reads the database's write-ahead log after commit; query-based and trigger-based CDC don't. That one difference drives almost everything that matters.

7 min read · Blunox

If you're replicating a source database into a warehouse or stream, you'll choose one of three capture methods. Log-based CDC tails the transaction log. Query-based CDC polls the table on a schedule. Trigger-based CDC installs database triggers that write to a shadow table. They differ on the things you'll actually be paged about: load on your primary, whether deletes get captured, how fresh the data is, and how much of your weekend the setup consumes.

The three CDC methods in one minute

Every method is a different answer to "how do I find out a row changed?"

For a broader primer on what CDC is and why it beats nightly batch reloads, see change data capture, explained. Here we're comparing the three ways to do it.

Query-based CDC

Query-based CDC is the one everyone reaches for first, because it needs nothing special: a timestamp or version column, and a scheduled job that pulls rows newer than the last high-water mark. It's genuinely fine for a lot of cases — a nightly sync of a reference table, a dashboard that can tolerate an hour of lag, a source you can't get replication access to.

The problems show up at the edges. It misses deletes. A deleted row simply stops appearing in your WHERE clause — there's no timestamp on a row that no longer exists, so your copy quietly keeps the stale record forever. It misses intermediate updates: if a row changes three times between polls, you see only the final state, which breaks any downstream logic that cares about the sequence. And it adds load to the source — every poll is a real query with a real scan, so tightening the interval to reduce lag directly increases pressure on the primary you're trying to protect. You end up trading freshness against source impact, and you can't win both.

Trigger-based CDC

Trigger-based CDC fixes the correctness gap: because a trigger fires inside the transaction, it sees deletes, and it sees every individual change in order. You install AFTER triggers on each tracked table that append the old and new row into a changelog table, then a reader drains that table.

The cost is that you've now put your capture logic on the write path. Every insert, update, and delete on the source does extra work — a second write into the audit table — inside the same transaction, so you've added latency and I/O to production writes, sometimes materially under load. Triggers are also brittle operationally: they're schema-coupled, so a column change can silently break capture; they multiply with every table you track; and they're application-visible surface area that has to be maintained, tested, and reasoned about during every migration. Powerful, but it makes your source database heavier and more fragile.

Query-based CDC asks the database what changed. Trigger-based CDC makes the database do extra work per write. Log-based CDC just reads what the database already wrote down.

Log-based CDC

Log-based CDC takes a different route entirely. Every durable database already writes a write-ahead log — the ordered record of committed changes it uses for crash recovery and replication. Log-based CDC reads that log after commit, decoding each change from a stream the database produces whether you're watching or not.

That single design choice fixes the trade-offs the other two methods force on you:

The catch is setup: log-based CDC needs replication-level access (a replication slot, binlog enabled, supplemental logging), and decoding each engine's log format is non-trivial. That's the work you're paying for — and increasingly the work you hand to a tool rather than build yourself.

Choosing a method

Line the three up against the dimensions you'll be judged on:

For real-time, complete, low-impact replication, log-based CDC wins — it's why it's the default under most serious streaming pipelines. But be honest about when simpler is enough: if a source only ever appends, never deletes, and an hour of lag is fine, query-based CDC is less machinery and perfectly correct. Match the method to the requirement, not to the hype.

How Blunox does it

Blunox Pulse uses log-based CDC. It reads the write-ahead log after commit — no triggers on your tables, no polling queries, no load on your primary — and delivers changes with exactly-once semantics, so every insert, update, and delete lands downstream once and in order. You get the completeness and freshness of log-based capture without owning the log-decoding and slot-management plumbing yourself.