Guide

How to replicate Postgres to BigQuery in real time

Moving Postgres to BigQuery sounds like a plumbing problem, but the details — deletes, schema drift, type mapping, and lag — are where most pipelines quietly break. Here's how the approaches actually compare and what a real-time setup looks like.

8 min read · Blunox

You have an operational Postgres database serving your app, and you want that data in BigQuery for analytics. On paper, replicating Postgres to BigQuery is straightforward: read the rows, write them to a table. In practice, the interesting decisions are about how fresh the data needs to be, how you handle changes over time, and how much load you're willing to put on your production database to get it there.

Why replicate Postgres to BigQuery

Postgres is a transactional database. It's tuned for many small, concurrent reads and writes that keep your product running. BigQuery is an analytical warehouse, tuned for scanning billions of rows to answer one big question. Trying to run dashboards, ad-hoc analysis, and ML feature pipelines directly against production Postgres is a recipe for lock contention, slow queries, and the occasional 2 a.m. page.

Replication solves this by giving analysts and models their own copy in BigQuery — one they can hammer with expensive queries without touching the database that keeps your app alive. The goal is a warehouse copy that's correct, current, and cheap to keep in sync. The three approaches below hit those goals to very different degrees.

The three approaches, and their trade-offs

Periodic dump and load. The simplest option: run a full extract (`pg_dump`, `COPY`, or a `SELECT *`), drop it in Cloud Storage, and load it into BigQuery on a schedule. It's easy to reason about and needs no special Postgres configuration. But it re-reads the entire table every run, so it's slow and expensive at scale, it hammers your production database during the extract, and "real time" means "as fresh as your last nightly job." Fine for small, slowly-changing tables; painful for anything large or busy.

Query-based incremental. Instead of the whole table, you pull only rows where an updated_at or monotonic id is greater than the last high-water mark. This is far lighter than a full dump and works with plain SQL. The catches are real, though: it depends on every table having a reliable, always-updated timestamp column; it cannot see hard deletes (a deleted row simply stops appearing, so BigQuery keeps a ghost); and rows updated between your read and your checkpoint can be missed or double-counted. It's a solid middle ground when deletes are rare and freshness of a few minutes is enough.

Log-based CDC. The gold standard for real-time. Postgres already writes every change — insert, update, delete — to its write-ahead log (WAL). Change data capture reads that log through logical replication and streams each change out in order. You touch the production database only to read a log it's already writing, you capture deletes and updates natively, and lag drops to sub-second. The cost is operational complexity: replication slots, publications, and careful checkpointing. If you want the mechanics in depth, see our explainer on change data capture.

If you need deletes, ordering, and freshness measured in seconds, log-based CDC is the only approach that gives you all three.

Handling the hard parts

Whichever approach you pick, the same problems show up once real data starts flowing.

A reference real-time setup

Here's the shape of a log-based Postgres to BigQuery pipeline, end to end:

That's the whole loop: snapshot once, stream forever, checkpoint after every commit, and let a view give consumers the clean picture.

Common pitfalls

How Blunox fits in

Everything above is doable by hand — and plenty of teams do build it. It's also exactly the kind of undifferentiated plumbing that's easy to get subtly wrong and tedious to operate. Blunox Pulse does log-based Postgres to BigQuery replication as a managed service: sub-second lag, exactly-once delivery, automatic initial snapshot and backfill, and schema-drift awareness so DDL changes flow through instead of breaking the pipeline. Slots, checkpointing, TOAST handling, and type mapping are handled for you, so your team gets a correct, current BigQuery copy without owning the replication machinery.