PLAN_UPGRADE_MIGRATIONS.md — Extension Upgrade Migrations

Status: Core infrastructure implemented; v0.9.0 F40 follow-through is in progress
Related: GAP_SQL_PHASE_7.md F40 (G8.3), PLAN_VERSIONING.md
Last updated: 2026-03-18

Current Status Snapshot

The upgrade-migration framework is already in place: hand-authored upgrade SQL, archived full-install SQL baselines, CI completeness validation, and true upgrade E2E tests all exist. For v0.9.0, F40 is now in the implementation follow-through phase rather than the design phase.

Completed for v0.9.0 so far:

  • Added the 0.8.0 -> 0.9.0 upgrade script.
  • Archived the generated full-install SQL baseline for 0.9.0.
  • Verified that upgrade E2E infrastructure can target 0.9.0.
  • Extended the fast completeness gate to validate new tables and indexes in addition to functions, views, and event triggers.

Still remaining before F40 is fully closed for v0.9.0:

  • Release finalization only: run cargo pgrx package, verify the output against sql/archive/pg_trickle--0.9.0.sql using the completeness script, then tag the release. This is an operational step, not a code change.

All code-level items are now done:

  • Column-drift detection added (CHECK 6 in check_upgrade_completeness.sh).
  • test_upgrade_v090_catalog_additions (L15) asserts the pgt_refresh_groups table structure and uniqueness constraint on fresh install.

1. Overview

pg_trickle installs SQL objects (functions, catalog tables, views, event triggers) and a shared library (.so / .dylib) into the user’s PostgreSQL cluster. When users upgrade, ALTER EXTENSION pg_trickle UPDATE must transition the SQL catalog from the old version to the new without data loss.

This plan defines the complete strategy for authoring, validating, and testing upgrade migrations. It addresses every known gap discovered during development, including a real production incident (2026-03-04) where three functions (list_sources, change_buffer_sizes, _signal_launcher_rescan) were missing from the upgrade path because the migration script was a no-op placeholder while the full install script included them automatically via pgrx codegen.


2. How PostgreSQL Extension Upgrades Work

-- User runs:
ALTER EXTENSION pg_trickle UPDATE;

-- PostgreSQL resolves the chain:
--   default_version in pg_trickle.control → say '0.3.0'
--   installed version                     → say '0.1.3'
-- PostgreSQL executes in order:
--   pg_trickle--0.1.3--0.2.0.sql
--   pg_trickle--0.2.0--0.3.0.sql

Key constraints:

  • Each upgrade script MUST be idempotent (safe to run on an already- upgraded database). Use CREATE OR REPLACE FUNCTION, DO $$ IF NOT EXISTS.
  • Scripts MUST NOT DROP anything a user might depend on without a deprecation cycle (see PLAN_VERSIONING.md §5).
  • The full install script (pg_trickle--X.Y.Z.sql) is auto-generated by cargo pgrx package. Upgrade scripts (pg_trickle--X--Y.sql) are hand-authored and live in sql/.
  • There is an inherent divergence risk: pgrx adds/changes functions in the full script automatically when #[pg_extern] annotations change, but those changes are invisible to the hand-authored upgrade script.

3. Architecture of the Upgrade System

3.1 File layout

sql/
├── pg_trickle--0.1.3--0.2.0.sql       # Hand-authored upgrade script
├── pg_trickle--0.2.0--0.3.0.sql       # (future)
└── ...

target/release/pg_trickle-pg18/usr/share/postgresql/18/extension/
├── pg_trickle--0.2.0.sql              # Auto-generated by cargo pgrx package
├── pg_trickle--0.1.3--0.2.0.sql       # Copied from sql/ at package time
└── pg_trickle.control

3.2 What the upgrade script must cover

Every upgrade script pg_trickle--X--Y.sql must handle all of the following categories, or explicitly document why they are not needed:

Category Detection method SQL pattern
New SQL functions Diff CREATE FUNCTION in full scripts CREATE OR REPLACE FUNCTION
Changed function signatures Diff parameter lists DROP FUNCTION ... ; CREATE OR REPLACE FUNCTION ...
Removed functions Deliberate deprecation DROP FUNCTION IF EXISTS ...
New catalog columns Diff CREATE TABLE / column lists DO $$ IF NOT EXISTS (... column) ALTER TABLE ADD COLUMN
Changed column types Manual review ALTER TABLE ... ALTER COLUMN ... TYPE ...
New indexes Diff CREATE INDEX CREATE INDEX IF NOT EXISTS
New views Diff CREATE VIEW CREATE OR REPLACE VIEW
Changed views Diff view definitions CREATE OR REPLACE VIEW
New event triggers Diff CREATE EVENT TRIGGER CREATE EVENT TRIGGER ... IF NOT EXISTS
GUC changes Manual (Rust code only) N/A (GUCs are registered at .so load time)

3.3 What the upgrade script must NOT touch

  • pgtrickle_changes.changes_<oid> tables — ephemeral, recreated by triggers
  • Internal Rust shared-memory layout — handled by .so reload
  • Background worker registration — handled by .so reload

4. Automated Upgrade Validation (CI)

4.1 SQL Object Completeness Check (P0 — must have)

A CI script that compares the pgrx-generated full install SQL against the cumulative upgrade path and reports any objects present in the full script but missing from the upgrade chain.

Implementation: scripts/check_upgrade_completeness.sh

#!/usr/bin/env bash
# Validates that all SQL objects in the full install script are also
# covered by the upgrade chain from the previous release version.
#
# Usage: scripts/check_upgrade_completeness.sh <from_version> <to_version>
# Example: scripts/check_upgrade_completeness.sh 0.1.3 0.2.0
#
# Exit code 0 = all objects covered, 1 = missing objects found.

set -euo pipefail

FROM_VERSION="${1:?Usage: $0 <from_version> <to_version>}"
TO_VERSION="${2:?Usage: $0 <from_version> <to_version>}"

FULL_SQL="target/release/pg_trickle-pg18/usr/share/postgresql/18/extension/pg_trickle--${TO_VERSION}.sql"
UPGRADE_SQL="sql/pg_trickle--${FROM_VERSION}--${TO_VERSION}.sql"

if [[ ! -f "$FULL_SQL" ]]; then
    echo "ERROR: Full install script not found: $FULL_SQL"
    echo "       Run 'cargo pgrx package' first."
    exit 1
fi

if [[ ! -f "$UPGRADE_SQL" ]]; then
    echo "ERROR: Upgrade script not found: $UPGRADE_SQL"
    exit 1
fi

# --- Extract function names from full install script ---
# Matches: CREATE FUNCTION pgtrickle."name"(  (with optional extra spaces)
grep -oP 'CREATE\s+FUNCTION\s+pgtrickle\.\"\K[^"]+' "$FULL_SQL" | sort -u > /tmp/full_functions.txt

# --- Extract function names from upgrade script ---
grep -oP 'CREATE\s+(OR\s+REPLACE\s+)?FUNCTION\s+pgtrickle\.\"\K[^"]+' "$UPGRADE_SQL" | sort -u > /tmp/upgrade_functions.txt

# --- Extract functions from the PREVIOUS full install script (baseline) ---
# These already exist in the DB and don't need to be in the upgrade script.
PREV_FULL_SQL="target/release/pg_trickle-pg18/usr/share/postgresql/18/extension/pg_trickle--${FROM_VERSION}.sql"
if [[ -f "$PREV_FULL_SQL" ]]; then
    grep -oP 'CREATE\s+FUNCTION\s+pgtrickle\.\"\K[^"]+' "$PREV_FULL_SQL" | sort -u > /tmp/prev_functions.txt
else
    # No previous full script available — use the old version's function list
    # from the Docker image (see §5.2 for how we archive these).
    echo "WARNING: Previous full install script not found: $PREV_FULL_SQL"
    echo "         Will report all functions as potentially missing."
    : > /tmp/prev_functions.txt
fi

# --- Functions that are NEW (in full but not in previous) ---
comm -23 /tmp/full_functions.txt /tmp/prev_functions.txt > /tmp/new_functions.txt

# --- New functions that are MISSING from the upgrade script ---
comm -23 /tmp/new_functions.txt /tmp/upgrade_functions.txt > /tmp/missing_functions.txt

# --- Also check for views, indexes, event triggers ---
# (Extend similarly for other object types)

MISSING_COUNT=$(wc -l < /tmp/missing_functions.txt)
if [[ "$MISSING_COUNT" -gt 0 ]]; then
    echo "ERROR: ${MISSING_COUNT} function(s) in full install script missing from upgrade path:"
    echo ""
    cat /tmp/missing_functions.txt | sed 's/^/  - pgtrickle./'
    echo ""
    echo "Add CREATE OR REPLACE FUNCTION statements to: $UPGRADE_SQL"
    exit 1
fi

echo "OK: All ${#$(cat /tmp/new_functions.txt)} new functions are covered by the upgrade script."

CI integration: Run in the lint/check job after cargo pgrx package:

# .github/workflows/ci.yml (upgrade-check job)
upgrade-check:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Build extension package
      run: cargo pgrx package --pg-config $(which pg_config)
    - name: Validate upgrade completeness
      run: scripts/check_upgrade_completeness.sh 0.1.3 0.2.0

4.2 Two-Version Upgrade E2E Tests (P0 — must have)

True upgrade tests that install the old version, populate data, run ALTER EXTENSION UPDATE, and verify everything works.

Implementation approach: multi-stage Dockerfile

# tests/Dockerfile.e2e-upgrade
#
# Stage 1: Build the OLD version from a tagged release
FROM postgres:18.1 AS builder-old
ARG FROM_VERSION=0.1.3
# ... install rust, cargo pgrx, checkout tag v${FROM_VERSION}, build ...

# Stage 2: Build the NEW (current) version
FROM postgres:18.1 AS builder-new
# ... install rust, cargo pgrx, build current source ...

# Stage 3: Runtime image with BOTH versions installed
FROM postgres:18.1
COPY --from=builder-old /usr/share/postgresql/18/extension/pg_trickle--${FROM_VERSION}.sql ...
COPY --from=builder-old /usr/lib/postgresql/18/lib/pg_trickle.so  pg_trickle_old.so
COPY --from=builder-new /usr/share/postgresql/18/extension/pg_trickle--${TO_VERSION}.sql ...
COPY --from=builder-new /usr/share/postgresql/18/extension/pg_trickle--${FROM_VERSION}--${TO_VERSION}.sql ...
COPY --from=builder-new /usr/lib/postgresql/18/lib/pg_trickle.so ...
COPY --from=builder-new /usr/share/postgresql/18/extension/pg_trickle.control ...

Alternative (faster): archive-based approach

To avoid building the old version from source every time in CI:

  1. On every release tag, CI publishes a tarball of the extension files (pg_trickle-<version>-pg18-linux-amd64.tar.gz) to GitHub Releases.
  2. The upgrade Dockerfile downloads the old version’s tarball and installs it.
  3. The new version is built from source as usual.

This reduces upgrade Docker builds from ~20 min to ~12 min.

Test cases in tests/e2e_upgrade_tests.rs:

// ── True version-to-version upgrade tests ───────────────────────

/// Install old version, create stream tables with data, run ALTER
/// EXTENSION UPDATE, verify all stream tables still function.
#[tokio::test]
async fn test_upgrade_X_to_Y_stream_tables_survive() { ... }

/// Verify all new functions are callable after upgrade.
#[tokio::test]
async fn test_upgrade_X_to_Y_new_functions_exist() { ... }

/// Verify that new catalog columns exist with correct defaults.
#[tokio::test]
async fn test_upgrade_X_to_Y_catalog_schema_migrated() { ... }

/// Verify that DIFFERENTIAL refresh still works on stream tables
/// created under the old version (change buffers are compatible).
#[tokio::test]
async fn test_upgrade_X_to_Y_differential_refresh_compat() { ... }

/// Verify that the background worker restarts cleanly after the
/// .so is replaced (no crashes, no orphaned workers).
#[tokio::test]
async fn test_upgrade_X_to_Y_bgworker_restarts() { ... }

4.3 Upgrade Smoke Test for Every PR (P1 — should have)

A lightweight check that runs on every PR (not just push-to-main):

  1. Build cargo pgrx package to generate the full SQL.
  2. Run scripts/check_upgrade_completeness.sh — fails fast if upgrade script is out of sync.
  3. Does NOT build the full two-version Docker image (that is too slow for PRs).

The full two-version E2E tests run on push-to-main and daily, same as existing E2E tests.


5. Upgrade Script Authoring Process

5.1 Step-by-step workflow

When bumping the extension version (e.g., 0.2.0 → 0.3.0):

  1. Build the full SQL for both versions: ```bash

    Build new version

    cargo pgrx package –pg-config $(which pg_config)

    Save the generated SQL

    cp target/release/…/pg_trickle–0.3.0.sql /tmp/new.sql

    Check out previous tag and build

    git stash && git checkout v0.2.0 cargo pgrx package –pg-config $(which pg_config) cp target/release/…/pg_trickle–0.2.0.sql /tmp/old.sql git checkout - && git stash pop ```

  2. Diff the two full SQL scripts: bash diff /tmp/old.sql /tmp/new.sql This reveals every new/changed/removed SQL object.

  3. Author the upgrade script in sql/pg_trickle--0.2.0--0.3.0.sql, covering every difference from the diff.

  4. Run the completeness check: bash scripts/check_upgrade_completeness.sh 0.2.0 0.3.0

  5. Test locally (see §5.3 below).

  6. Add test cases to tests/e2e_upgrade_tests.rs for the new version pair.

5.2 Archiving previous version SQL

To make step 1 faster and enable the completeness check without building old versions, we archive the full SQL file for each released version:

sql/archive/
├── pg_trickle--0.1.3.sql.gz    # Full install script from v0.1.3 release
├── pg_trickle--0.2.0.sql.gz    # Full install script from v0.2.0 release
└── ...

The release workflow adds this automatically:

# In .github/workflows/release.yml
- name: Archive full SQL for future upgrade diffs
  run: |
    gzip -k "target/release/.../pg_trickle--${VERSION}.sql"
    cp "target/release/.../pg_trickle--${VERSION}.sql.gz" sql/archive/

5.3 Local testing workflow

# Quick: completeness check only (no Docker)
just check-upgrade 0.1.3 0.2.0

# Full: build upgrade Docker image and run E2E tests
just test-upgrade 0.1.3 0.2.0

Justfile targets:

# Validate upgrade script covers all new objects
check-upgrade from to:
    cargo pgrx package --pg-config $(which pg_config)
    scripts/check_upgrade_completeness.sh {{from}} {{to}}

# Build two-version Docker image and run upgrade E2E tests
test-upgrade from to:
    docker build \
      --build-arg FROM_VERSION={{from}} \
      --build-arg TO_VERSION={{to}} \
      -f tests/Dockerfile.e2e-upgrade \
      -t pg_trickle_upgrade_test:latest .
    cargo test --test e2e_upgrade_tests -- --test-threads=1

6. Change Buffer Handling During Upgrade

pgtrickle_changes.changes_<oid> tables are created dynamically per source table. They contain in-flight change records that should NOT be discarded during an upgrade.

Strategy:

  1. The upgrade script does NOT touch pgtrickle_changes.* tables directly.
  2. If the change buffer schema changes (new columns), the Rust code handles missing columns via ALTER TABLE ... ADD COLUMN IF NOT EXISTS during the first CDC trigger fire after upgrade.
  3. The Rust code compares its expected buffer schema against the actual table at trigger time and self-heals.

In-flight record safety:

  • ALTER EXTENSION UPDATE runs inside a single transaction — either all DDL succeeds or none does.
  • The background worker’s scheduler loop checks pg_trickle.enabled and will pause processing during DDL transactions (lock contention causes the worker tick to skip).
  • Recommendation: perform upgrades during low-traffic windows.

7. Rollback / Downgrade

PostgreSQL does not support automatic extension downgrades. If a downgrade is needed:

  1. Export all stream table definitions: bash pg_trickle_dump --format sql --output backup.sql Uses the standalone dump tool added in Phase 5.3. The tool also accepts --dsn, DATABASE_URL, or PG* environment variables.

  2. DROP EXTENSION pg_trickle CASCADE;destroys all stream tables

  3. Install old version: CREATE EXTENSION pg_trickle VERSION '0.1.3';

  4. Re-create stream tables from the backup: bash psql -f backup.sql


8. Upgrade Path Matrix

Every released version pair must have a direct upgrade script. PostgreSQL can chain scripts (0.1.3 → 0.2.0 → 0.3.0), but we also provide direct jumps for common paths to minimize downtime.

From To Script Status
0.1.3 0.2.0 pg_trickle--0.1.3--0.2.0.sql ✅ Done
0.2.0 0.2.1 pg_trickle--0.2.0--0.2.1.sql ✅ Done
0.2.1 0.2.2 pg_trickle--0.2.1--0.2.2.sql ✅ Done
0.2.2 0.2.3 pg_trickle--0.2.2--0.2.3.sql ✅ Done
0.2.2 0.3.0 pg_trickle--0.2.2--0.3.0.sql Not started
0.1.3 0.2.2 Chained via 0.1.3→0.2.0→0.2.1→0.2.2 Automatic
0.3.0 1.0.0 pg_trickle--0.3.0--1.0.0.sql Not started

9. Implementation Plan

Phase 1: Completeness Check Script (P0) — ✅ DONE

Goal: Never ship an upgrade script that silently drops functions again.

# Task Deliverable Status
1.1 Write scripts/check_upgrade_completeness.sh Shell script (224 lines) ✅ Done
1.2 Add just check-upgrade target to justfile Justfile entry ✅ Done
1.3 Extend script to check views, event triggers, indexes Enhanced script ✅ Done
1.4 Add CI job (runs on every PR in lint stage) upgrade-check job in ci.yml ✅ Done
1.5 Archive the v0.1.3 + v0.2.0 full SQL in sql/archive/ SQL files ✅ Done

Acceptance criteria: - just check-upgrade 0.1.3 0.2.0 passes (current state) - Adding a new #[pg_extern] without updating the upgrade script causes CI failure with a clear error message naming the missing function.

Phase 2: Two-Version Upgrade Docker Image (P0) — ✅ DONE

Goal: True E2E test that installs the old version, populates data, and upgrades.

# Task Deliverable Status
2.1 Create tests/Dockerfile.e2e-upgrade (extends base image) Dockerfile ✅ Done
2.2 Add build_e2e_upgrade_image.sh helper script Shell script ✅ Done
2.3 Publish release artifacts (extension tarballs) in release workflow SQL archive step in release.yml ✅ Done
2.4 Add just build-upgrade-image and just test-upgrade targets Justfile entries ✅ Done

Acceptance criteria: - just test-upgrade 0.1.3 0.2.0 builds an image with 0.1.3 pre-installed, starts Postgres, and makes the image available for E2E tests.

Phase 3: Upgrade E2E Test Suite (P0) — ✅ DONE

Goal: Automated tests that exercise real upgrade paths end-to-end.

# Task Deliverable Status
3.1 Upgrade tests use PGS_E2E_IMAGE + PGS_UPGRADE_FROM/TO env vars e2e_upgrade_tests.rs ✅ Done
3.2 Test L8: all new functions exist after upgrade e2e_upgrade_tests.rs ✅ Done
3.3 Test L9: existing stream tables survive and refresh e2e_upgrade_tests.rs ✅ Done
3.4 Test L10: all views queryable after upgrade e2e_upgrade_tests.rs ✅ Done
3.5 Test L11: event triggers present after upgrade e2e_upgrade_tests.rs ✅ Done
3.6 Test L12: version function reports new version e2e_upgrade_tests.rs ✅ Done
3.7 Test L13: function parity with fresh install e2e_upgrade_tests.rs ✅ Done
3.8 Test: DIFFERENTIAL refresh works on pre-upgrade STs e2e_upgrade_tests.rs Deferred (needs real old binary)
3.9 Test: chained upgrade (0.1.3 → 0.2.0 → 0.3.0) e2e_upgrade_tests.rs Deferred (0.3.0 not yet)

Acceptance criteria: - cargo test --test e2e_upgrade_tests passes all tests. - Tests run in CI on push-to-main and daily schedule.

Phase 4: CI Pipeline Integration (P1) — ✅ DONE

Goal: Upgrade checks are part of the standard CI workflow.

# Task Deliverable Status
4.1 Add upgrade-check lint job (every PR) ci.yml upgrade-check job ✅ Done
4.2 Add upgrade-e2e job (push-to-main + daily) ci.yml upgrade-e2e-tests job ✅ Done
4.3 Add release-workflow step to archive full SQL release.yml SQL archive step ✅ Done
4.4 Update CI coverage table in AGENTS.md Documentation ✅ Done

CI coverage after this phase:

Job PR Push to main Daily Manual
Upgrade completeness check
Upgrade E2E tests

Phase 5: Upgrade Documentation & Tooling (P2) — Partially Done

Goal: Users have clear upgrade instructions and a safety net.

# Task Deliverable Status
5.1 Write docs/UPGRADING.md user-facing guide Documentation ✅ Done
5.2 Add pre-upgrade version check to CREATE EXTENSION path Rust code ✅ Done (v0.2.2)
5.3 Implement pg_trickle_dump backup tool (SQL export) Rust / SQL ✅ Done (v0.2.3)
5.4 Add upgrade section to docs/FAQ.md Documentation ✅ Done (v0.2.2)
5.5 Document upgrade path in INSTALL.md Documentation ✅ Done

Phase 5.3 deliverable: pg_trickle_dump now exports replayable SQL using pgtrickle.create_stream_table(...), orders dependent stream tables topologically, prefers original_query when available, and restores non-ACTIVE status after replay.


10. Lessons Learned (Incident Log)

2026-03-04: Missing functions after ALTER EXTENSION UPDATE

Symptom: pgtrickle.change_buffer_sizes() returned “function does not exist” despite the .dylib exporting the symbol and the full install SQL containing the definition.

Root cause: The upgrade script pg_trickle--0.1.3--0.2.0.sql was a no-op placeholder (SELECT 'no-op'). The user’s database was upgraded from 0.1.3 → 0.2.0 via ALTER EXTENSION UPDATE, which ran only the upgrade script — not the full install script. Three functions new in 0.2.0 were never registered: list_sources, change_buffer_sizes, _signal_launcher_rescan.

Fix: Added the missing CREATE OR REPLACE FUNCTION statements to the upgrade script. For the running database, executed the same statements manually using $libdir/pg_trickle as the library path.

Update (2026-03-05): The completeness check script found 8 additional missing functions (dependency_tree, diamond_groups, health_check, pgt_ivm_apply_delta, pgt_ivm_handle_truncate, refresh_timeline, trigger_inventory, version). All 11 new functions are now included in the upgrade script. The completeness check passes.

Systemic fix: Phase 1 of this plan (completeness check script in CI) prevents this class of bug from recurring.


11. Upgrade Script Authoring Checklist

For every version bump, complete all items before merging:

  • [ ] Diff the old and new pgrx-generated full SQL scripts
  • [ ] Write sql/pg_trickle--X--Y.sql covering all differences
  • [ ] Use CREATE OR REPLACE FUNCTION for new/changed functions
  • [ ] Use DO $$ IF NOT EXISTS guards for ALTER TABLE ADD COLUMN
  • [ ] Use CREATE INDEX IF NOT EXISTS for new indexes
  • [ ] Use CREATE OR REPLACE VIEW for new/changed views
  • [ ] Run just check-upgrade X Y — must pass with zero findings
  • [ ] Run just test-upgrade X Y — full E2E upgrade tests pass
  • [ ] Archive the full SQL in sql/archive/
  • [ ] Bump default_version in pg_trickle.control
  • [ ] Document changes in CHANGELOG.md
  • [ ] Add entry to the upgrade path matrix (§8)

12. Open Questions

# Question Status
Q1 Should we provide direct-jump scripts (e.g., 0.1.3→1.0.0) or rely only on chaining? Direct jumps are faster but more scripts to maintain. Decide at 0.3.0
Q2 Should the extension warn at CREATE EXTENSION time if .so version ≠ SQL version (stale install)? ✅ Done (v0.2.2) — implemented as scheduler startup check
Q3 How do we handle pg_trickle.control default_version with pgrx’s @CARGO_VERSION@ macro during packaging? Currently works; verify in CI
Q4 Should upgrade E2E tests also cover CNPG (CloudNativePG) operator upgrades? Defer to post-1.0

13. Remaining Work (Prioritized)

  1. DIFFERENTIAL compat test (3.8) — Requires building a real v0.1.3 binary from source to test CDC trigger compatibility across versions.
  2. Chained upgrade test (3.9) — Blocked until v0.3.0 exists.

Upgrade validation for the released v0.2.3 path is complete: the migration chain and upgrade E2E suite have been exercised through the current release, including just check-upgrade 0.2.2 0.2.3 and just test-upgrade 0.1.3 0.2.3.


References