Architecture

Architecture

1. Split of responsibilities

            ┌─────────────────────────── node A (LEADER / primary) ──────────────────────────┐
            │  Postgres                                                                        │
            │   ├─ pg_replica extension (.so)                                                  │
            │   │    ├─ background worker: Raft node  ◄────raft RPC────┐                       │
            │   │    │     • election, membership, cluster state        │                      │
            │   │    │     • health checks, failover state machine      │                      │
            │   │    │     • executes promote / rewind / reconfigure     │                     │
            │   │    └─ SQL API:  pg_replica.status() / .members() / .failover() / .add_node() │
            │   └─ WAL  ──────────────physical streaming replication──────────────┐           │
            └──────────────────────────────────────────────────────────────────── │ ──────────┘
                         ▲ raft RPC                                                 │ WAL
                         │                                                          ▼
            ┌──────────── node B (follower / standby) ──┐   ┌──────── node C (follower / standby) ──┐
            │ Postgres + pg_replica bgworker (Raft)     │   │ Postgres + pg_replica bgworker (Raft) │
            │ recovery mode, primary_conninfo → A       │   │ recovery mode, primary_conninfo → A   │
            └───────────────────────────────────────────┘   └───────────────────────────────────────┘

  supervisor (systemd / docker restart=always) keeps each Postgres process alive; pg_replica decides its ROLE.
  clients reach the primary via libpq multi-host (target_session_attrs=read-write) or an HAProxy that reads pg_replica state.

Two planes, deliberately separate:

Control plane (Raft, tiny): elects the leader, tracks membership and the authoritative “current primary + replication topology,” and drives failover. The Raft log only ever carries small control entries (membership changes, leader term, failover decisions, fencing tokens). Megabytes, not gigabytes.
Data plane (Postgres WAL, large): the actual rows/indexes/roles/DDL move over Postgres’s own physical streaming replication. We configure it; we do not reimplement it. This is what makes “replicate everything including roles and DDL” true for free.

See DECISIONS.md D1 for why Raft must not carry the data itself.

2. Components

2.1 `pg_replica` extension (Rust, via `pgrx`)

Built with pgrx so it stays in the same toolchain as ParadeDB-style extensions. Two parts:

Background worker pg_replica supervisor — registered via RegisterBackgroundWorker at shared_preload_libraries. Hosts:
- the Raft node (openraft, an async Raft) on a small embedded tokio runtime, with a peer transport (TCP, length-prefixed JSON RPC) that also multiplexes LSN gossip;
- a durable Raft log + snapshot store in $PGDATA/pg_replica/ (separate from Postgres WAL);
- the health monitor (heartbeats + libpq SELECT 1 probes of peers);
- the failover state machine (detect → elect → choose → fence → promote → reconfigure → rejoin);
- executors that call pg_promote(), edit postgresql.auto.conf (primary_conninfo), manage replication slots, and shell out to pg_basebackup / pg_rewind.
SQL surface (functions + views):
- pg_replica.status() → this node’s role, term, leader, LSNs, lag.
- pg_replica.members() → cluster membership + health.
- pg_replica.failover([target]) → operator-initiated switchover.
- pg_replica.add_node(dsn) / pg_replica.remove_node(id) → membership.
- GUCs: pg_replica.peers, pg_replica.raft_port, pg_replica.node_id, pg_replica.synchronous, pg_replica.failover_timeout_ms, pg_replica.bootstrap.

2.2 The supervisor you already have

pg_replica does not start/stop the Postgres process — a chicken/egg an in-Postgres worker can’t solve. That job stays with systemd or the Docker restart policy (restart: always). pg_replica decides each running node’s role; the OS supervisor guarantees the process keeps trying to run. This is the honest reason it can be “just an extension” and still do failover (see DECISIONS.md D3).

2.3 Client routing (not built, recommended)

pg_replica only publishes who the primary is. Pick one: - libpq multi-host: host=A,B,C ... target_session_attrs=read-write — driver finds the writable node. Zero infra. - HAProxy with an httpchk against pg_replica’s tiny health endpoint (/primary → 200 only on the leader). - Floating/virtual IP moved by the leader on promotion.

3. Failover flow (primary dies)

Detect. Followers' health monitors miss pg_replica.failover_timeout_ms of heartbeats from the leader. Raft leadership lease expires.
Elect (control plane). Surviving nodes hold a Raft election; only the majority partition can elect — a minority (incl. an isolated old primary) cannot, which is what prevents split-brain.
Self-fence the loser. A primary that loses quorum demotes itself (drops to read-only / shuts down) via a loss-of-quorum watchdog, before a new one is promoted. A monotonic fencing token (Raft term) guards against a paused old primary resuming.
Choose the most-advanced replica. The new Raft leader queries each survivor’s pg_last_wal_replay_lsn() / pg_last_wal_receive_lsn() and picks the highest, minimizing data loss. Decision is committed to the Raft log.
Promote. SELECT pg_promote() on the chosen standby.
Reconfigure the rest. Remaining standbys get primary_conninfo repointed at the new primary; slots recreated; reload.
Rejoin the loser. When the old primary returns, pg_rewind it against the new primary to discard diverged WAL, then start it as a standby.
Publish. New primary recorded in Raft state; routing layer (HAProxy/VIP) follows.

4. Data-loss posture (sync vs async)

Configurable via pg_replica.synchronous: - async (default): lowest latency; failover may lose the last unreplicated commits (bounded by replica lag). - quorum-sync: sets synchronous_standby_names = 'ANY 1 (nodeB,nodeC)' so a commit is acked by ≥1 standby → failover loses nothing. Cost: write latency, and writes stall if too few sync standbys are up. pg_replica keeps synchronous_standby_names in step with live membership so a single failure never wedges writes.

5. The hard problems (and our stance)

Problem	Stance
Split-brain	Raft majority quorum + self-fence on quorum loss + fencing token (term). Never two writable primaries.
Promoting a stale replica	Always pick highest LSN among survivors; `pg_rewind` the others.
Old primary resurrecting	Loss-of-quorum watchdog demotes it; term-based fencing token rejects its stale writes/role.
Quorum math	Odd cluster sizes (3 → tolerate 1, 5 → tolerate 2). 2-node HA is impossible safely; document it.
Raft node dies with Postgres	Fine: a down node doesn’t need to vote; survivors keep quorum. OS supervisor restarts Postgres → bgworker rejoins.
Hung (not dead) Postgres	Watchdog timer; if the bgworker can’t make progress, it stops accepting the leader role.
Bootstrap / new replica	`pg_basebackup` from the leader, register in Raft, stream.
Network partition flapping	Election timeouts + leader leases + openraft’s pre-vote to avoid term churn.

6. On-disk / network footprint

Raft state: $PGDATA/pg_replica/{raft-log, snapshot, meta} — small.
Peer transport: one TCP port per node (pg_replica.raft_port), mTLS optional.
No external store, no JVM, no Go control plane. Memory target: a few MB of resident bgworker overhead beyond Postgres itself.

PGXN

PostgreSQL Extension Network

Contents

Architecture

1. Split of responsibilities

2. Components

2.1 `pg_replica` extension (Rust, via `pgrx`)

2.2 The supervisor you already have

2.3 Client routing (not built, recommended)

3. Failover flow (primary dies)

4. Data-loss posture (sync vs async)

5. The hard problems (and our stance)

6. On-disk / network footprint

PGXN

PostgreSQL Extension Network

Contents

Architecture

1. Split of responsibilities

2. Components

2.1 pg_replica extension (Rust, via pgrx)

2.2 The supervisor you already have

2.3 Client routing (not built, recommended)

3. Failover flow (primary dies)

4. Data-loss posture (sync vs async)

5. The hard problems (and our stance)

6. On-disk / network footprint

2.1 `pg_replica` extension (Rust, via `pgrx`)