PLAN_PARTITIONING_SHARDING.md — PostgreSQL Partitioning & Sharding Compatibility

Status: Implemented (PT1–PT5 in v0.6.0)
Date: 2026-03-01
Updated: 2026-03-14 — PT1–PT5 implemented. E2E tests, ATTACH PARTITION detection, WAL publish_via_partition_root, foreign table info, and user-facing tutorial all shipped.
Related: PLAN_CITUS.md · REPORT_TIMESCALEDB.md · REPORT_TRIGGERS_VS_REPLICATION.md


1. Executive Summary

This report analyses how pg_trickle interacts with PostgreSQL’s native table partitioning (declarative partitioning, available since PG 10) and external sharding solutions (Citus, pg_partman). It covers three scenarios:

  1. Partitioned source tables — source tables referenced in stream table defining queries are partitioned.
  2. Partitioned stream table storage — the materialized storage table backing a stream table is itself partitioned.
  3. Distributed/sharded deployments — the database spans multiple nodes (Citus, foreign-data-wrapper sharding, read replicas).

Key findings:

  • Native partitioning of source tables works today with minor caveats. PostgreSQL 13+ propagates row-level triggers from the partitioned parent to child partitions, so CDC captures all DML transparently. The extension already logs an info message (F13) confirming this.
  • Partitioned storage tables are not supported and would require non-trivial changes to the storage table creation, MERGE/delta application, and row ID management paths.
  • Distributed sharding (Citus) is a much larger problem with a dedicated plan already in PLAN_CITUS.md. The core issues (OID locality, WAL LSN incomparability, trigger propagation) are architectural.
  • The most impactful near-term improvement is validating and documenting the native partitioning story, and adding targeted tests.

2. Background: PostgreSQL Partitioning

2.1. Declarative Partitioning (PG 10+)

PostgreSQL supports declarative partitioning via PARTITION BY:

CREATE TABLE orders (
    id         BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    customer_id INT,
    total      NUMERIC
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
CREATE TABLE orders_2026 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');

Partition strategies: RANGE, LIST, HASH. Sub-partitioning is supported.

2.2. Key Properties for pg_trickle

Property Behaviour Impact
OID Parent table has its own OID; each child partition has a separate OID CDC triggers are keyed by OID — must use parent OID
Triggers PG 13+: row-level triggers on parent fire for DML routed to any partition CDC works transparently
Triggers (PG 11–12) Triggers on parent do NOT fire for partition-routed DML CDC would miss all changes — effectively broken
TRUNCATE TRUNCATE parent truncates all partitions; fires statement-level trigger on parent Our TRUNCATE capture trigger works
DDL propagation ALTER TABLE parent ADD COLUMN propagates to all partitions DDL hooks see the parent’s OID — reinit works
Partition DDL ALTER TABLE parent ATTACH/DETACH PARTITION fires ddl_command_end Hook fires but no column change → no action needed
Primary keys PK must include all partition key columns pk_hash computation works (resolves PK via parent OID)
Statistics pg_class.reltuples on parent is the sum of child reltuples (PG 14+) Adaptive FULL/DIFFERENTIAL ratio works correctly
WAL Each partition generates its own WAL records Logical replication must include all partitions in publication
Indexes Partitioned indexes are virtual; each partition has its own No impact (we don’t read source indexes)

2.3. Partition Management Extensions

  • pg_partman — Automated partition creation/retention. Compatible: it manages DDL on child partitions, which doesn’t affect the parent’s schema. Dropping old partitions is equivalent to DROP TABLE on a child — the parent OID remains valid.
  • TimescaleDB hypertables — Transparent time-based partitioning. See REPORT_TIMESCALEDB.md for detailed analysis. Hypertables are partitioned tables with additional metadata; triggers on the parent propagate to chunks.

3. Scenario 1: Partitioned Source Tables

3.1. Current Status — Works (PG 13+)

When a user creates a stream table over a partitioned source:

SELECT pgtrickle.create_stream_table(
    'daily_totals',
    $$SELECT date_trunc('day', created_at) AS day,
             sum(total) AS total
      FROM orders
      GROUP BY 1$$,
    schedule := '5 minutes'
);

The following happens:

  1. Source resolutionextract_source_relations() resolves orders to the parent’s OID. Child partition OIDs are NOT returned because the parser sees only the parent.

  2. Relkind detectionwarn_source_table_properties() detects relkind = 'p' and emits F13 info message.

  3. CDC trigger creationcreate_change_trigger() creates the trigger on the parent table. PostgreSQL 13+ automatically propagates the trigger to all existing and future child partitions.

  4. Change capture — When DML hits any child partition, the trigger fires with NEW/OLD rows, writing to pgtrickle_changes.changes_<parent_oid>. The pg_current_wal_lsn() call in the trigger correctly reflects the WAL position of the child partition’s write.

  5. PK resolutionresolve_pk_columns() queries pg_constraint on the parent OID. For partitioned tables, the PK constraint is on the parent (and includes partition key columns). This works correctly.

  6. Column resolutionresolve_source_column_defs() queries pg_attribute on the parent OID. All partitions share the same column set. Works correctly.

  7. Full refreshSELECT ... FROM orders naturally reads all partitions via partition pruning. Works correctly.

  8. Differential refresh — The delta query reads from pgtrickle_changes.changes_<parent_oid>. Since all partition DML writes to the same buffer (keyed by parent OID), the delta is complete.

3.2. Caveats

Caveat Severity Details
PG 11–12 incompatibility N/A pg_trickle targets PG 18 only; not relevant
ATTACH PARTITION with existing data Medium ALTER TABLE orders ATTACH PARTITION orders_2027 ... does NOT fire row-level triggers for existing rows in the newly attached partition. The stream table will miss these rows until the next FULL refresh or reinitialization.
DETACH PARTITION Low ALTER TABLE orders DETACH PARTITION orders_2024 ... removes the partition. Future DML on the detached table won’t fire the parent’s trigger. Existing change buffer data for already-captured rows is fine. DDL hooks should detect the ALTER TABLE and consider reinit.
Very large partition counts Low With hundreds of partitions, the trigger is installed on each child. Trigger overhead is per-row and per-partition, but the trigger function is shared. No scalability concern observed up to ~500 partitions in PG benchmarks.
WAL mode CDC Medium For logical replication, the publication must include all partitions. CREATE PUBLICATION ... FOR TABLE orders in PG 13+ uses the publish_via_partition_root option to group all child WAL records under the parent table. wal_decoder.rs must set this option when creating publications for partitioned sources.

3.3. ATTACH PARTITION Gap

This is the most significant gap. When a partition with pre-existing data is attached to a partitioned source table:

-- This table already has 1M rows:
CREATE TABLE orders_2027 (LIKE orders INCLUDING ALL);
INSERT INTO orders_2027 SELECT ... ;

-- Attach it — no row-level triggers fire for existing rows:
ALTER TABLE orders ATTACH PARTITION orders_2027
    FOR VALUES FROM ('2027-01-01') TO ('2028-01-01');

The 1M existing rows are now visible via SELECT * FROM orders but were never captured by the CDC trigger. The stream table is stale.

Detection: The DDL event trigger (_on_ddl_end) fires for ALTER TABLE ... ATTACH PARTITION. The command_tag is ALTER TABLE and the objid is the parent table’s OID. The hook should detect that the parent is a tracked source and mark downstream STs for reinitialize.

Current behaviour: The DDL hook currently checks for column changes (ADD/ DROP/ALTER COLUMN). An ATTACH PARTITION doesn’t change columns, so it may not trigger reinit. This needs verification and potentially a targeted check for the ATTACH PARTITION subcommand.

Recommended fix: In handle_ddl_command(), when the command_tag is ALTER TABLE and the target is a partitioned source, additionally check pg_event_trigger_ddl_commands() for partition-related subcommands. If ATTACH PARTITION is detected, force needs_reinit = true.

3.4. WAL Mode: Publication Configuration

When transitioning from trigger to WAL mode for a partitioned source, the publication must be configured correctly:

-- Current code in wal_decoder.rs creates:
CREATE PUBLICATION pgtrickle_<oid> FOR TABLE <table_name>;

-- For partitioned tables, this should use:
CREATE PUBLICATION pgtrickle_<oid>
    FOR TABLE <table_name>
    WITH (publish_via_partition_root = true);

The publish_via_partition_root option (PG 13+) ensures that changes to child partitions are published under the parent table’s identity, matching the trigger-mode behavior where all changes appear in a single buffer keyed by the parent OID.

Without this option, each child partition publishes its changes under its own OID, and the decoder would need to map child OIDs back to the parent — a significant complication.


4. Scenario 2: Partitioned Stream Table Storage

4.1. Current Status — Not Supported

Stream table storage tables are created as plain heap tables:

CREATE TABLE <schema>.<name> (
    __pgt_row_id BIGINT NOT NULL,
    col1 TYPE,
    col2 TYPE,
    ...
);
CREATE UNIQUE INDEX ... ON <schema>.<name> (__pgt_row_id);

There is no mechanism to partition the storage table.

4.2. Why Users Might Want This

  • Large materialized views — When a stream table materializes billions of rows (e.g., time-series aggregations), partitioning the storage table by time could enable partition pruning on downstream queries and efficient retention management.
  • TTL / retention — Users could drop old partitions of a stream table rather than running expensive DELETE operations.
  • Query performance — Partition pruning on the storage table accelerates analytical queries that filter by the partition key.

4.3. Challenges

Challenge Severity Details
Storage table creation Medium api.rs creates the storage table via CREATE TABLE. Would need to accept an optional PARTITION BY clause from the user and generate child partitions or rely on pg_partman.
MERGE/delta application High The differential refresh uses DELETE + INSERT keyed by __pgt_row_id. With partitioning, the partition key must be part of the row or derivable from it. The MERGE target would need to be the parent table, and PostgreSQL would route rows to the correct partition. This works transparently for INSERT but DELETE requires the partition key to be in the WHERE clause for partition pruning (otherwise it scans all partitions).
Row ID uniqueness Medium __pgt_row_id is a hash, not monotonically increasing. A UNIQUE INDEX on a partitioned table requires the partition key in the index. The unique constraint would become (__pgt_row_id, <partition_key>), which weakens the uniqueness guarantee to per-partition scope.
Full refresh (TRUNCATE + INSERT) Low TRUNCATE on a partitioned table truncates all partitions. Works correctly.
Reinitialize Medium Currently drops and recreates the storage table. Would need to preserve the partition structure, or the user would need to re-specify it.
User triggers Low User triggers on the storage parent propagate to partitions. No issue.

4.4. Potential Design

If implemented, partitioned storage could be an opt-in feature:

SELECT pgtrickle.create_stream_table(
    'daily_totals',
    $$SELECT date_trunc('day', created_at) AS day,
             sum(total) AS total
      FROM orders GROUP BY 1$$,
    schedule := '5 minutes',
    partition_by := 'RANGE (day)',   -- new parameter
    partition_spec := '{
        "auto_create": true,
        "retention": "2 years",
        "interval": "1 month"
    }'
);

The extension would: 1. Create the parent storage table with PARTITION BY RANGE (day). 2. Use pg_partman or custom logic to auto-create child partitions. 3. Modify delta application to include the partition key in DELETE clauses. 4. Adjust the __pgt_row_id unique index to include the partition key. 5. Optionally manage retention (drop partitions older than threshold).

Recommendation: Defer this to a future release. The complexity is moderate and the user base requesting partitioned storage tables should be established first. Users needing partitioned storage today can create a partitioned table manually and use a full-refresh-only stream table.


5. Scenario 3: Distributed Sharding (Citus)

5.1. Summary

Distributed sharding via Citus is fundamentally different from native partitioning. While native partitioning keeps all data on a single PostgreSQL instance with transparent query routing, Citus distributes data across multiple independent PostgreSQL nodes.

A comprehensive plan exists in PLAN_CITUS.md. The key incompatibilities are:

Component Native Partitioning Citus Distributed
OIDs Parent OID is stable; child OIDs are local OIDs differ across coordinator and workers
WAL Single WAL stream Independent WAL per node
Triggers Propagate from parent to children Do NOT propagate to worker shards
SPI Single instance, full catalog access Coordinator has catalog; workers have partial
MERGE Fully supported Limited support; cross-shard joins problematic
Background workers Single scheduler One scheduler per node or coordinator-only
Advisory locks Instance-wide Node-local

5.2. Impact on pg_trickle Components

Component Native Partitioning Impact Citus Impact
CDC triggers (cdc.rs) Works — triggers propagate Broken — triggers don’t propagate to workers
Change buffers Single buffer per parent OID Need distributed or coordinator-aggregated buffers
WAL decoder (wal_decoder.rs) Needs publish_via_partition_root Need multi-node slot management
Frontier/LSN (version.rs) Single WAL, single LSN timeline Incomparable LSNs across nodes
Delta SQL (dvm/) No change needed Cross-shard join complications
MERGE application (refresh.rs) Works transparently Need INSERT ON CONFLICT + DELETE
DAG/scheduler (dag.rs, scheduler.rs) No change Need coordinator-only or multi-node coordination
DDL hooks (hooks.rs) Works — events fire on parent Only fires on coordinator
Shared memory (shmem.rs) No change Node-local; need LISTEN/NOTIFY

5.3. Recommendation

Native PostgreSQL partitioning and Citus sharding should be treated as two entirely separate features with different timelines:

  1. Native partitioning — Near-term; mostly works today. Needs testing, ATTACH PARTITION detection, and WAL publication configuration.
  2. Citus — Long-term; architectural changes needed. Follow the phased plan in PLAN_CITUS.md.

6. Scenario 4: Foreign Data Wrapper (FDW) Sharding

6.1. Overview

Some deployments use postgres_fdw to create a “sharded” view across multiple PostgreSQL instances:

CREATE FOREIGN TABLE remote_orders (...)
    SERVER remote_server OPTIONS (table_name 'orders');

Or with partitioned foreign tables (PG 14+):

CREATE TABLE orders (...) PARTITION BY RANGE (region);
CREATE FOREIGN TABLE orders_us PARTITION OF orders
    FOR VALUES IN ('US')
    SERVER us_server OPTIONS (table_name 'orders');
CREATE FOREIGN TABLE orders_eu PARTITION OF orders
    FOR VALUES IN ('EU')
    SERVER eu_server OPTIONS (table_name 'orders');

6.2. Compatibility

Aspect Status Details
Source resolution Works Foreign tables have OIDs in the local catalog
CDC triggers Broken Row-level triggers cannot be created on foreign tables (ERROR: "..." is not a table or view)
WAL mode Broken Foreign tables don’t generate local WAL
Full refresh Works SELECT * FROM foreign_table executes the remote query
Partition mix Partial A partitioned table with some local and some foreign children: triggers fire only on local partitions; foreign partition DML is invisible to CDC

6.3. Recommendation

Foreign-table sources should be restricted to FULL refresh mode only. The extension should detect relkind = 'f' (foreign table) during source resolution and emit a warning that differential refresh is not available.

For partitioned tables with a mix of local and foreign children, the safest approach is to require FULL refresh mode, since CDC cannot capture changes from foreign partitions.


7. Test Coverage Gaps

Current test coverage for partitioning scenarios:

Scenario Covered? Notes
Partitioned source — basic CRUD No Need E2E test with RANGE partitioned source
Partitioned source — differential refresh No Verify delta capture across partitions
Partitioned source — partition pruning in full refresh No Verify optimizer uses pruning
ATTACH PARTITION with existing data No Verify reinit or staleness detection
DETACH PARTITION No Verify DDL hook fires
pg_partman automatic partition creation No Verify triggers propagate to new partitions
Source with sub-partitioning No Two-level partitioning
Partitioned source — WAL mode CDC No Verify publish_via_partition_root
Foreign table source — full refresh only No Verify trigger creation error handling
Mixed local/foreign partitions No Verify correct warning or mode restriction

Recommended test additions: Create a new tests/e2e_partitioning_tests.rs file with tests covering the scenarios above.


8. Action Items

Near-Term (Current Release)

# Action Component Effort
1 Add E2E tests for partitioned source tables (basic CRUD, differential refresh) tests/ 1–2 days
2 Verify ATTACH PARTITION triggers reinit; add detection if missing src/hooks.rs 1 day
3 Set publish_via_partition_root = true in WAL publication creation for partitioned sources src/wal_decoder.rs Half day
4 Detect foreign table sources (relkind = 'f') and restrict to FULL mode src/api.rs Half day
5 Document partitioned source table support in user docs docs/ Half day

Medium-Term (Future Release)

# Action Component Effort
6 Explore partitioned storage table support (opt-in) src/api.rs, src/refresh.rs 2–3 weeks
7 Handle DETACH PARTITION explicitly in DDL hooks src/hooks.rs 1 day
8 Test with pg_partman automated partition management tests/ 1–2 days

Long-Term

# Action Component Effort
9 Citus distributed table support All modules ~6 months (see PLAN_CITUS.md)
10 FDW-sharded source support with remote CDC New module Research needed

9. Summary Matrix

Deployment Pattern Source Tables Storage Tables Status
Native partitioning (RANGE/LIST/HASH) Works (PG 13+) Not supported Ready with caveats
Sub-partitioning Works (same as above) Not supported Untested
pg_partman managed Works (triggers propagate to new partitions) Not supported Untested
TimescaleDB hypertables Works (see REPORT_TIMESCALEDB.md) Not supported Research
Citus distributed Not supported Not supported See PLAN_CITUS.md
Citus reference tables Works (coordinator-local triggers) Possible Untested
FDW foreign tables Full refresh only N/A Needs implementation
Mixed local + foreign partitions Full refresh only N/A Needs implementation

Appendix A: PostgreSQL Partition Trigger Propagation

Starting in PostgreSQL 13, row-level triggers created on a partitioned parent table are automatically cloned to all existing and newly created child partitions. The trigger fires with the child partition’s TG_TABLE_NAME and TG_RELID, but the trigger function is the same.

Key behavior details:

-- Create trigger on parent
CREATE TRIGGER cdc_trigger
    AFTER INSERT OR UPDATE OR DELETE ON orders
    FOR EACH ROW EXECUTE FUNCTION capture_changes();

-- PostgreSQL automatically creates matching triggers on:
--   orders_2025, orders_2026, orders_2027, ...

-- New partition created later:
CREATE TABLE orders_2028 PARTITION OF orders
    FOR VALUES FROM ('2028-01-01') TO ('2029-01-01');
-- Trigger is automatically cloned to orders_2028

Within the trigger function for pg_trickle, the TG_RELID is the child partition’s OID, not the parent’s. However, since our trigger function is generated with the parent OID hardcoded (e.g., pg_trickle_cdc_fn_<parent_oid>), the change buffer table name is derived from the parent OID, and all partition DML is correctly routed to the same buffer table.

Appendix B: Partition Key in Row ID

For partitioned storage tables (Scenario 2), the __pgt_row_id uniqueness guarantee requires special handling:

Current:  UNIQUE INDEX ON storage (__pgt_row_id)
          → Works for unpartitioned tables

With partitioning:
          UNIQUE INDEX ON storage (__pgt_row_id, <partition_key>)
          → Uniqueness is per-partition only
          → Two rows in different partitions could have the same __pgt_row_id

Alternative:
          Use __pgt_row_id as a composite of (original_hash, partition_key_hash)
          → Globally unique but changes hash semantics

This is one of the key design decisions that would need resolution before implementing partitioned storage tables.