Durability & zero-loss guide

How close pg_replica (+ the hyperiondb-client pool) can get to “no lost writes or reads,” what configuration that requires, and the failure modes that no configuration removes.

“Zero loss” is really three separate guarantees, with three different mechanisms:

Guarantee Means Achievable?
Durability every acked (committed) write survives a failover Yes, within the cluster’s fault budget — with synchronous quorum commit
Availability no failed requests during a failover Not automatically — there is a ~5 s window; needs app-level retry + idempotency
Read correctness no stale / split-brain reads Yes from the primary; standby reads are eventually-consistent

A. Durability — zero acked-write loss

A failover can only be lossless if every acked transaction is already on a node that can be promoted. That requires synchronous quorum commit.

Configuration

Set one GUC; pg_replica does the rest:

# postgresql.conf  (pg_replica.synchronous is a Postmaster GUC — needs a restart)
pg_replica.synchronous = on

When on, the primary continuously maintains

synchronous_standby_names = 'ANY <majority-1> (peer1, peer2, ...)'

so a COMMIT is not acked until the WAL is flushed on a majority of nodes. Because the sync quorum (ANY majority-1 standbys + the primary = a majority) is the same majority Raft elects a leader from, every acked write is guaranteed to be on whichever node wins the next election. test-quorum-consistency exists to prove these two quorums never drift apart.

Cluster synchronous_standby_names set by pg_replica Tolerates (zero-loss)
3 nodes ANY 1 (n2, n3) 1 node/leader failure
5 nodes ANY 2 (n2, n3, n4, n5) 2 simultaneous failures

Also required (these are PostgreSQL defaults — do not weaken them):

  • synchronous_commit = on (the default). Never local, off, or remote_write (remote_write acks before the standby fsyncs, so a standby crash can still lose it). Use remote_apply if you additionally want a committed row to be visible on the confirming standbys before the ack (helps read-your-writes — see section C).
  • data_checksums = on at initdb time, so silent disk corruption is detected, not replicated as truth.

What synchronous mode costs

  • Latency: a commit waits for a standby flush (one extra network round-trip).
  • Availability: if fewer than majority-1 standbys are caught up, writes block until one returns. That is the correct trade — blocking beats acking a write that could be lost. (With pg_replica.synchronous = off you get async replication: lower latency, but a primary can ack a commit and crash before replicating it → that write is lost on failover.)

The residual durability risk

Even fully synchronous, a 3-node ANY 1 cluster can lose an acked write only if a fault impairs the sync-confirming standby at the same time as the primary, mid-failover (the “sync-ANY-k edge” noted in test-chaos). Clean single failovers are zero-loss; to survive two overlapping faults, run 5 nodes / ANY 2.


B. Availability — failed requests during a failover

A failover takes ~5 s to a writable new primary. During that window there is no writable primary, and the client cannot hide that for you:

  • hyperiondb-client retries the connection checkout with backoff up to acquireTimeoutMs (default 5000), then throws no writable primary available after <ms>ms.
  • It does not replay your SQL — only the connection acquisition. A transaction in flight when the primary dies fails, and your application must retry it.
  • Ambiguous COMMIT: if the connection drops after COMMIT was sent but before the ack, the transaction may have committed and replicated. The client correctly reports an error, but a blind retry would double-apply. This is the in-doubt-transaction problem and cannot be solved at the pool layer.

To make failovers invisible to end users, the application needs all three:

  1. Retry on the typed no writable primary… error and on connection-drop errors.
  2. acquireTimeoutMs ≥ your failover time (raise it above ~5 s so a single failover surfaces as a delay, not an error).
  3. Idempotent writes, so retrying an ambiguous commit is safe — e.g. a client-generated UUID with INSERT … ON CONFLICT DO NOTHING, or an outbox/dedup key. Reads are inherently safe to retry.

C. Reads

Reads never get “lost” (they don’t mutate), but they can be stale or, without fencing, wrong:

  • A mode: 'read-only' / prefer-standby pool reads from standbys, which lag the primary. For read-your-writes consistency, read from the read-write pool (primary), or use synchronous_commit = remote_apply (makes a commit visible on the confirming standbys before it acks — still not on all standbys).
  • Split-brain reads are prevented: a demoted or minority-partitioned primary fences itself read-only (default_transaction_read_only = on), and the client’s checkout validation (SHOW transaction_read_only) evicts those connections — so you never read from a stale ex-primary as if it were current (test-m4-fence, test-m4-partition).

Failure modes at a glance

Layer Failure Effect Mitigation
Postgres repl async / weak synchronous_commit acked write lost on failover pg_replica.synchronous = on, keep synchronous_commit = on
pg_replica quorum lost (≥2 of 3 down) no leader → writes block correct (safety > availability); 5 nodes to tolerate 2
pg_replica minority-partitioned primary that side can’t write self-fences → no split-brain
pg_replica primary + sync standby fault together, mid-failover rare acked-write loss 5-node ANY 2; backups
client failover window (~5 s) queries fail retry + raise acquireTimeoutMs
client ambiguous COMMIT unknown outcome idempotent writes
client read-only pool stale reads read the primary for strong reads
infra supervisor doesn’t restart PG capacity shrinks → quorum risk reliable systemd / Docker restart policy
infra single DC / disk corruption / bad SQL (DELETE w/o WHERE) total or logical loss multi-AZ, data_checksums, backups + PITR

What this does not replace

Raft + synchronous replication protect against node and leader failure inside one cluster. They do nothing for:

  • Logical errors / bad migrations → you still need backups + PITR (pgBackRest, wal-g).
  • Correlated loss (one rack / AZ / DC / region) → spread nodes across AZs, or add a remote (optionally synchronous) standby, at a latency cost.
  • Bugs / operator error → backups, again.

Backups and PITR are an explicit non-goal of pg_replica (see the README) — they remain mandatory for real durability.


Verdict

  • Zero loss of acknowledged writes: achievable and tested for the failures pg_replica is built for (single node/leader failure, clean failover) iff pg_replica.synchronous = on with synchronous_commit = on. Not absolute: overlapping faults beyond the budget, correlated/DC loss, corruption, and logical errors still need 5-node clusters, multi-AZ, checksums, and backups.
  • Zero failed requests: only with app-level retry + idempotency; the ~5 s failover window is real and the pool will not replay statements for you.
  • Zero stale reads: read from the primary (or remote_apply); standby reads are eventually-consistent by design.

“Zero loss” end-to-end is a property of cluster config + application retry/idempotency + backups together — not of pg_replica or the client alone.

Validated by

Property Test (scripts/)
Quorum-sync = zero committed-transaction loss on failover test-m7-sync
Sync quorum and Raft quorum name the same nodes test-quorum-consistency
Continuous writer, faults injected, 0 split-brain + zero-loss for clean failovers test-chaos
Highest-LSN survivor is promoted (no data loss) test-m3-lsn
Minority primary self-fences read-only test-m4-fence, test-m4-partition
Client follows the failover with only a reconnect test-m6-routing

See also ARCHITECTURE.md, DECISIONS.md, and the client guide CLIENT.md.