Guide

How to replicate Postgres to BigQuery in real time

Moving Postgres to BigQuery sounds like a plumbing problem, but the details — deletes, schema drift, type mapping, and lag — are where most pipelines quietly break. Here's how the approaches actually compare and what a real-time setup looks like.

8 min read · Blunox

You have an operational Postgres database serving your app, and you want that data in BigQuery for analytics. On paper, replicating Postgres to BigQuery is straightforward: read the rows, write them to a table. In practice, the interesting decisions are about how fresh the data needs to be, how you handle changes over time, and how much load you're willing to put on your production database to get it there.

Why replicate Postgres to BigQuery

Postgres is a transactional database. It's tuned for many small, concurrent reads and writes that keep your product running. BigQuery is an analytical warehouse, tuned for scanning billions of rows to answer one big question. Trying to run dashboards, ad-hoc analysis, and ML feature pipelines directly against production Postgres is a recipe for lock contention, slow queries, and the occasional 2 a.m. page.

Replication solves this by giving analysts and models their own copy in BigQuery — one they can hammer with expensive queries without touching the database that keeps your app alive. The goal is a warehouse copy that's correct, current, and cheap to keep in sync. The three approaches below hit those goals to very different degrees.

The three approaches, and their trade-offs

Periodic dump and load. The simplest option: run a full extract (`pg_dump`, `COPY`, or a `SELECT *`), drop it in Cloud Storage, and load it into BigQuery on a schedule. It's easy to reason about and needs no special Postgres configuration. But it re-reads the entire table every run, so it's slow and expensive at scale, it hammers your production database during the extract, and "real time" means "as fresh as your last nightly job." Fine for small, slowly-changing tables; painful for anything large or busy.

Query-based incremental. Instead of the whole table, you pull only rows where an updated_at or monotonic id is greater than the last high-water mark. This is far lighter than a full dump and works with plain SQL. The catches are real, though: it depends on every table having a reliable, always-updated timestamp column; it cannot see hard deletes (a deleted row simply stops appearing, so BigQuery keeps a ghost); and rows updated between your read and your checkpoint can be missed or double-counted. It's a solid middle ground when deletes are rare and freshness of a few minutes is enough.

Log-based CDC. The gold standard for real-time. Postgres already writes every change — insert, update, delete — to its write-ahead log (WAL). Change data capture reads that log through logical replication and streams each change out in order. You touch the production database only to read a log it's already writing, you capture deletes and updates natively, and lag drops to sub-second. The cost is operational complexity: replication slots, publications, and careful checkpointing. If you want the mechanics in depth, see our explainer on change data capture.

If you need deletes, ordering, and freshness measured in seconds, log-based CDC is the only approach that gives you all three.

Handling the hard parts

Whichever approach you pick, the same problems show up once real data starts flowing.

Type mapping. Postgres and BigQuery don't share a type system. numeric maps to NUMERIC/BIGNUMERIC (watch precision), jsonb usually lands as JSON or a stringified column, timestamptz becomes TIMESTAMP (UTC) while naive timestamp becomes DATETIME, arrays become REPEATED fields, and uuid collapses to STRING. Getting these wrong silently corrupts data, so pin the mapping explicitly.
Deletes. BigQuery is append-optimized and dislikes row-level mutation. The common pattern is soft deletes: keep every version of a row with an _op (insert/update/delete) and an _ingested_at column, then resolve the current state with a view that takes the latest row per primary key. You get full history and a correct "live" view.
Schema drift and DDL. Someone will add a column, widen a type, or rename a field. A rigid pipeline breaks; a good one detects the DDL and evolves the BigQuery schema — adding nullable columns, relaxing modes — without a manual migration.
Initial snapshot and backfill. CDC only sees changes from the moment it starts. You need a consistent snapshot of existing rows, then a clean hand-off to the streaming change feed with no gap and no double-load. Doing this correctly — snapshot at a known log position, then replay changes from exactly that point — is the subtle heart of a good pipeline.
Exactly-once. Networks retry, processes crash. Without care you get duplicates. The fix is idempotent writes keyed on primary key plus log sequence number, so replaying the same change is a no-op rather than a second row.

A reference real-time setup

Here's the shape of a log-based Postgres to BigQuery pipeline, end to end:

Configure Postgres. Set wal_level = logical, create a PUBLICATION for the tables you want, and create a logical replication slot. The slot is what guarantees Postgres retains WAL until your consumer has processed it.
Take a consistent snapshot. Export current rows at the log position the slot was created at, so the snapshot and the change stream line up exactly with no overlap or gap.
Stream changes. A consumer reads the slot (via a logical decoding plugin like pgoutput), turns each WAL record into a typed change event, and writes it to BigQuery through the Storage Write API for low-latency, exactly-once ingestion.
Checkpoint. After each batch commits to BigQuery, advance the slot's confirmed position (flush_lsn). This lets Postgres reclaim WAL and, on restart, tells you exactly where to resume.
Resolve current state. Land changes in an append-only staging table, then expose a view (or periodic merge) that returns the latest row per primary key with deletes filtered out.

That's the whole loop: snapshot once, stream forever, checkpoint after every commit, and let a view give consumers the clean picture.

Common pitfalls

Replication slot bloat. If your consumer stalls or falls behind, Postgres cannot discard WAL — the slot pins it. Disk fills, and eventually the database stops accepting writes. Monitor slot lag and alert early; an unmonitored slot is a production outage waiting to happen.
Large TOAST values. Postgres stores oversized values (big text, JSON, bytea) out-of-line in TOAST. In logical decoding, an unchanged TOASTed column can arrive as a placeholder rather than its value, so a naive consumer writes nulls. You have to detect and handle unchanged-toast columns explicitly.
DDL changes. Logical replication streams data changes, not schema changes. An ALTER TABLE upstream won't flow through automatically — your pipeline has to notice the shape changed and evolve the BigQuery schema to match, or new columns silently vanish.
Lag creep. Even a healthy pipeline drifts if throughput exceeds your write path. Watch end-to-end lag (commit time in Postgres to visibility in BigQuery), not just slot position, and size your ingestion for peak write volume, not average.

How Blunox fits in

Everything above is doable by hand — and plenty of teams do build it. It's also exactly the kind of undifferentiated plumbing that's easy to get subtly wrong and tedious to operate. Blunox Pulse does log-based Postgres to BigQuery replication as a managed service: sub-second lag, exactly-once delivery, automatic initial snapshot and backfill, and schema-drift awareness so DDL changes flow through instead of breaking the pipeline. Slots, checkpointing, TOAST handling, and type mapping are handled for you, so your team gets a correct, current BigQuery copy without owning the replication machinery.

Book a demo Explore Blunox Pulse