Architecture

1. Split of responsibilities

            ┌─────────────────────────── node A (LEADER / primary) ──────────────────────────┐
            │  Postgres                                                                        │
            │   ├─ pg_replica extension (.so)                                                  │
            │   │    ├─ background worker: Raft node  ◄────raft RPC────┐                       │
            │   │    │     • election, membership, cluster state        │                      │
            │   │    │     • health checks, failover state machine      │                      │
            │   │    │     • executes promote / rewind / reconfigure     │                     │
            │   │    └─ SQL API:  pg_replica.status() / .members() / .failover() / .add_node() │
            │   └─ WAL  ──────────────physical streaming replication──────────────┐           │
            └──────────────────────────────────────────────────────────────────── │ ──────────┘
                         ▲ raft RPC                                                 │ WAL
                         │                                                          ▼
            ┌──────────── node B (follower / standby) ──┐   ┌──────── node C (follower / standby) ──┐
            │ Postgres + pg_replica bgworker (Raft)     │   │ Postgres + pg_replica bgworker (Raft) │
            │ recovery mode, primary_conninfo → A       │   │ recovery mode, primary_conninfo → A   │
            └───────────────────────────────────────────┘   └───────────────────────────────────────┘

  supervisor (systemd / docker restart=always) keeps each Postgres process alive; pg_replica decides its ROLE.
  clients reach the primary via libpq multi-host (target_session_attrs=read-write) or an HAProxy that reads pg_replica state.

Two planes, deliberately separate:

  • Control plane (Raft, tiny): elects the leader, tracks membership and the authoritative “current primary + replication topology,” and drives failover. The Raft log only ever carries small control entries (membership changes, leader term, failover decisions, fencing tokens). Megabytes, not gigabytes.

  • Data plane (Postgres WAL, large): the actual rows/indexes/roles/DDL move over Postgres’s own physical streaming replication. We configure it; we do not reimplement it. This is what makes “replicate everything including roles and DDL” true for free.

See DECISIONS.md D1 for why Raft must not carry the data itself.

2. Components

2.1 pg_replica extension (Rust, via pgrx)

Built with pgrx so it stays in the same toolchain as ParadeDB-style extensions. Two parts:

  1. Background worker pg_replica supervisor — registered via RegisterBackgroundWorker at shared_preload_libraries. Hosts:

    • the Raft node (openraft, an async Raft) on a small embedded tokio runtime, with a peer transport (TCP, length-prefixed JSON RPC) that also multiplexes LSN gossip;
    • a durable Raft log + snapshot store in $PGDATA/pg_replica/ (separate from Postgres WAL);
    • the health monitor (heartbeats + libpq SELECT 1 probes of peers);
    • the failover state machine (detect → elect → choose → fence → promote → reconfigure → rejoin);
    • executors that call pg_promote(), edit postgresql.auto.conf (primary_conninfo), manage replication slots, and shell out to pg_basebackup / pg_rewind.
  2. SQL surface (functions + views):

    • pg_replica.status() → this node’s role, term, leader, LSNs, lag.
    • pg_replica.members() → cluster membership + health.
    • pg_replica.failover([target]) → operator-initiated switchover.
    • pg_replica.add_node(dsn) / pg_replica.remove_node(id) → membership.
    • GUCs: pg_replica.peers, pg_replica.raft_port, pg_replica.node_id, pg_replica.synchronous, pg_replica.failover_timeout_ms, pg_replica.bootstrap.

2.2 The supervisor you already have

pg_replica does not start/stop the Postgres process — a chicken/egg an in-Postgres worker can’t solve. That job stays with systemd or the Docker restart policy (restart: always). pg_replica decides each running node’s role; the OS supervisor guarantees the process keeps trying to run. This is the honest reason it can be “just an extension” and still do failover (see DECISIONS.md D3).

2.3 Client routing (not built, recommended)

pg_replica only publishes who the primary is. Pick one: - libpq multi-host: host=A,B,C ... target_session_attrs=read-write — driver finds the writable node. Zero infra. - HAProxy with an httpchk against pg_replica’s tiny health endpoint (/primary → 200 only on the leader). - Floating/virtual IP moved by the leader on promotion.

3. Failover flow (primary dies)

  1. Detect. Followers' health monitors miss pg_replica.failover_timeout_ms of heartbeats from the leader. Raft leadership lease expires.
  2. Elect (control plane). Surviving nodes hold a Raft election; only the majority partition can elect — a minority (incl. an isolated old primary) cannot, which is what prevents split-brain.
  3. Self-fence the loser. A primary that loses quorum demotes itself (drops to read-only / shuts down) via a loss-of-quorum watchdog, before a new one is promoted. A monotonic fencing token (Raft term) guards against a paused old primary resuming.
  4. Choose the most-advanced replica. The new Raft leader queries each survivor’s pg_last_wal_replay_lsn() / pg_last_wal_receive_lsn() and picks the highest, minimizing data loss. Decision is committed to the Raft log.
  5. Promote. SELECT pg_promote() on the chosen standby.
  6. Reconfigure the rest. Remaining standbys get primary_conninfo repointed at the new primary; slots recreated; reload.
  7. Rejoin the loser. When the old primary returns, pg_rewind it against the new primary to discard diverged WAL, then start it as a standby.
  8. Publish. New primary recorded in Raft state; routing layer (HAProxy/VIP) follows.

4. Data-loss posture (sync vs async)

Configurable via pg_replica.synchronous: - async (default): lowest latency; failover may lose the last unreplicated commits (bounded by replica lag). - quorum-sync: sets synchronous_standby_names = 'ANY 1 (nodeB,nodeC)' so a commit is acked by ≥1 standby → failover loses nothing. Cost: write latency, and writes stall if too few sync standbys are up. pg_replica keeps synchronous_standby_names in step with live membership so a single failure never wedges writes.

5. The hard problems (and our stance)

Problem Stance
Split-brain Raft majority quorum + self-fence on quorum loss + fencing token (term). Never two writable primaries.
Promoting a stale replica Always pick highest LSN among survivors; pg_rewind the others.
Old primary resurrecting Loss-of-quorum watchdog demotes it; term-based fencing token rejects its stale writes/role.
Quorum math Odd cluster sizes (3 → tolerate 1, 5 → tolerate 2). 2-node HA is impossible safely; document it.
Raft node dies with Postgres Fine: a down node doesn’t need to vote; survivors keep quorum. OS supervisor restarts Postgres → bgworker rejoins.
Hung (not dead) Postgres Watchdog timer; if the bgworker can’t make progress, it stops accepting the leader role.
Bootstrap / new replica pg_basebackup from the leader, register in Raft, stream.
Network partition flapping Election timeouts + leader leases + openraft’s pre-vote to avoid term churn.

6. On-disk / network footprint

  • Raft state: $PGDATA/pg_replica/{raft-log, snapshot, meta} — small.
  • Peer transport: one TCP port per node (pg_replica.raft_port), mTLS optional.
  • No external store, no JVM, no Go control plane. Memory target: a few MB of resident bgworker overhead beyond Postgres itself.