Contents
Architecture
1. Split of responsibilities
┌─────────────────────────── node A (LEADER / primary) ──────────────────────────┐
│ Postgres │
│ ├─ pg_replica extension (.so) │
│ │ ├─ background worker: Raft node ◄────raft RPC────┐ │
│ │ │ • election, membership, cluster state │ │
│ │ │ • health checks, failover state machine │ │
│ │ │ • executes promote / rewind / reconfigure │ │
│ │ └─ SQL API: pg_replica.status() / .members() / .failover() / .add_node() │
│ └─ WAL ──────────────physical streaming replication──────────────┐ │
└──────────────────────────────────────────────────────────────────── │ ──────────┘
▲ raft RPC │ WAL
│ ▼
┌──────────── node B (follower / standby) ──┐ ┌──────── node C (follower / standby) ──┐
│ Postgres + pg_replica bgworker (Raft) │ │ Postgres + pg_replica bgworker (Raft) │
│ recovery mode, primary_conninfo → A │ │ recovery mode, primary_conninfo → A │
└───────────────────────────────────────────┘ └───────────────────────────────────────┘
supervisor (systemd / docker restart=always) keeps each Postgres process alive; pg_replica decides its ROLE.
clients reach the primary via libpq multi-host (target_session_attrs=read-write) or an HAProxy that reads pg_replica state.
Two planes, deliberately separate:
Control plane (Raft, tiny): elects the leader, tracks membership and the authoritative “current primary + replication topology,” and drives failover. The Raft log only ever carries small control entries (membership changes, leader term, failover decisions, fencing tokens). Megabytes, not gigabytes.
Data plane (Postgres WAL, large): the actual rows/indexes/roles/DDL move over Postgres’s own physical streaming replication. We configure it; we do not reimplement it. This is what makes “replicate everything including roles and DDL” true for free.
See DECISIONS.md D1 for why Raft must not carry the data itself.
2. Components
2.1 pg_replica extension (Rust, via pgrx)
Built with pgrx so it stays in
the same toolchain as ParadeDB-style extensions. Two parts:
Background worker
pg_replica supervisor— registered viaRegisterBackgroundWorkeratshared_preload_libraries. Hosts:- the Raft node (
openraft, an async Raft) on a small embedded tokio runtime, with a peer transport (TCP, length-prefixed JSON RPC) that also multiplexes LSN gossip; - a durable Raft log + snapshot store in
$PGDATA/pg_replica/(separate from Postgres WAL); - the health monitor (heartbeats + libpq
SELECT 1probes of peers); - the failover state machine (detect → elect → choose → fence → promote → reconfigure → rejoin);
- executors that call
pg_promote(), editpostgresql.auto.conf(primary_conninfo), manage replication slots, and shell out topg_basebackup/pg_rewind.
- the Raft node (
SQL surface (functions + views):
pg_replica.status()→ this node’s role, term, leader, LSNs, lag.pg_replica.members()→ cluster membership + health.pg_replica.failover([target])→ operator-initiated switchover.pg_replica.add_node(dsn)/pg_replica.remove_node(id)→ membership.- GUCs:
pg_replica.peers,pg_replica.raft_port,pg_replica.node_id,pg_replica.synchronous,pg_replica.failover_timeout_ms,pg_replica.bootstrap.
2.2 The supervisor you already have
pg_replica does not start/stop the Postgres process — a chicken/egg an
in-Postgres worker can’t solve. That job stays with systemd or the Docker
restart policy (restart: always). pg_replica decides each running node’s
role; the OS supervisor guarantees the process keeps trying to run. This is the
honest reason it can be “just an extension” and still do failover (see
DECISIONS.md D3).
2.3 Client routing (not built, recommended)
pg_replica only publishes who the primary is. Pick one:
- libpq multi-host: host=A,B,C ... target_session_attrs=read-write — driver finds the writable node. Zero infra.
- HAProxy with an httpchk against pg_replica’s tiny health endpoint (/primary → 200 only on the leader).
- Floating/virtual IP moved by the leader on promotion.
3. Failover flow (primary dies)
- Detect. Followers' health monitors miss
pg_replica.failover_timeout_msof heartbeats from the leader. Raft leadership lease expires. - Elect (control plane). Surviving nodes hold a Raft election; only the majority partition can elect — a minority (incl. an isolated old primary) cannot, which is what prevents split-brain.
- Self-fence the loser. A primary that loses quorum demotes itself (drops to read-only / shuts down) via a loss-of-quorum watchdog, before a new one is promoted. A monotonic fencing token (Raft term) guards against a paused old primary resuming.
- Choose the most-advanced replica. The new Raft leader queries each
survivor’s
pg_last_wal_replay_lsn()/pg_last_wal_receive_lsn()and picks the highest, minimizing data loss. Decision is committed to the Raft log. - Promote.
SELECT pg_promote()on the chosen standby. - Reconfigure the rest. Remaining standbys get
primary_conninforepointed at the new primary; slots recreated; reload. - Rejoin the loser. When the old primary returns,
pg_rewindit against the new primary to discard diverged WAL, then start it as a standby. - Publish. New primary recorded in Raft state; routing layer (HAProxy/VIP) follows.
4. Data-loss posture (sync vs async)
Configurable via pg_replica.synchronous:
- async (default): lowest latency; failover may lose the last unreplicated
commits (bounded by replica lag).
- quorum-sync: sets synchronous_standby_names = 'ANY 1 (nodeB,nodeC)' so a
commit is acked by ≥1 standby → failover loses nothing. Cost: write latency,
and writes stall if too few sync standbys are up. pg_replica keeps
synchronous_standby_names in step with live membership so a single failure
never wedges writes.
5. The hard problems (and our stance)
| Problem | Stance |
|---|---|
| Split-brain | Raft majority quorum + self-fence on quorum loss + fencing token (term). Never two writable primaries. |
| Promoting a stale replica | Always pick highest LSN among survivors; pg_rewind the others. |
| Old primary resurrecting | Loss-of-quorum watchdog demotes it; term-based fencing token rejects its stale writes/role. |
| Quorum math | Odd cluster sizes (3 → tolerate 1, 5 → tolerate 2). 2-node HA is impossible safely; document it. |
| Raft node dies with Postgres | Fine: a down node doesn’t need to vote; survivors keep quorum. OS supervisor restarts Postgres → bgworker rejoins. |
| Hung (not dead) Postgres | Watchdog timer; if the bgworker can’t make progress, it stops accepting the leader role. |
| Bootstrap / new replica | pg_basebackup from the leader, register in Raft, stream. |
| Network partition flapping | Election timeouts + leader leases + openraft’s pre-vote to avoid term churn. |
6. On-disk / network footprint
- Raft state:
$PGDATA/pg_replica/{raft-log, snapshot, meta}— small. - Peer transport: one TCP port per node (
pg_replica.raft_port), mTLS optional. - No external store, no JVM, no Go control plane. Memory target: a few MB of resident bgworker overhead beyond Postgres itself.