Full E2E Test Suite — Deep Evaluation Report

Full E2E Test Suite — Deep Evaluation Report

Date: 2025-03-16 Scope: 18 full-E2E-only test files (222 tests, ~11,000 lines) requiring the custom Docker image with the compiled extension Goal: Assess coverage confidence and identify mitigations to harden the suite

Implementation Status

Updated: 2026-03-17 Branch: test-evals-full-e2e

Completed Mitigations

Priority	Item	Status	Files Changed
P0-1	WAL CDC data capture multiset assertions	✅ Done	`e2e_wal_cdc_tests.rs`
P0-2	Partition tests multiset assertions	✅ Done	`e2e_partition_tests.rs`
P0-3	DDL event post-reinit data assertions	✅ Done	`e2e_ddl_event_tests.rs`
P0-4	Circular ST convergence data assertions	✅ Done	`e2e_circular_tests.rs`
P1-1	Fix RLS superuser bypass in test	✅ Done	`e2e_rls_tests.rs`
P1-2	Add multiset to append-only fallback tests	✅ Done	`e2e_append_only_tests.rs`
P1-3	Add multiset to cascade regression tests 3 and 6	✅ Done	`e2e_cascade_regression_tests.rs`
P1-4	Add multiset to bootstrap gating refresh tests 12 and 17	✅ Done	`e2e_bootstrap_gating_tests.rs`
P2-1	Benchmark smoke assertions	✅ Done	`e2e_bench_tests.rs`
P2-2	Add multiset after ALTER QUERY	✅ Done	`e2e_alter_query_tests.rs`
P2-3	Upgrade survival multiset	✅ Done	`e2e_upgrade_tests.rs`
P2-4	Non-convergence guaranteed divergence	✅ Done	`e2e_circular_tests.rs`
P3-1	Cascade ad-hoc to multiset	✅ Done	`e2e_cascade_regression_tests.rs`
P3-2	DELETE/UPDATE in bootstrap gating	✅ Done	`e2e_bootstrap_gating_tests.rs`
P3-3	Standardize bgworker multiset	✅ Done	`e2e_bgworker_tests.rs`

P0-1 Details (WAL CDC)

Added assert_st_matches_query to four tests: - test_wal_cdc_captures_insert — verifies all inserted rows decoded correctly - test_wal_cdc_captures_update — verifies update reflected via WAL pipeline - test_wal_cdc_captures_delete — verifies only kept rows remain - test_wal_fallback_on_missing_slot — verifies no data loss after fallback

P0-2 Details (Partitions)

Added assert_st_matches_query to six tests: - test_partition_range_full_refresh — row-level correctness for RANGE + FULL - test_partition_range_differential_refresh — correctness after I/U/D across partitions - test_partition_list_source — aggregated result correctness for LIST partition - test_partition_hash_source — no row loss/corruption for HASH partition - test_partition_with_aggregation — full GROUP BY result over both partitions - test_partition_differential_with_aggregation — GROUP BY result after cross-partition INSERT

P0-3 Details (DDL Events)

Added post-reinit data assertions to five tests: - test_function_change_marks_st_for_reinit — refreshes after replacement, verifies new function body applies - test_add_column_on_source_st_still_functional — multiset after ADD COLUMN refresh - test_add_column_unused_st_survives_refresh — multiset verifies unused column excluded - test_drop_unused_column_st_survives — multiset after DROP COLUMN refresh - test_alter_column_type_triggers_reinit — refreshes after type change, verifies correct data

P0-4 Details (Circular)

Added to test_circular_monotone_cycle_converges: - Row count assertion: ≥6 pairs for transitive closure of 3-node chain - Existence assertion: pair (1,4) must exist — requires 2+ fixpoint iterations

P1-1 Details (RLS)

Fixed test_rls_on_stream_table_filters_reads: - Uses db.pool.begin() + SET LOCAL ROLE rls_reader in a transaction - Asserts count = 2 (only tenant_id=10 rows visible) as restricted role - Existing superuser assertion count = 4 retained

P1-2 Details (Append-Only)

Added assert_st_matches_query to three tests: - test_append_only_fallback_on_delete — verifies row absent after DELETE + MERGE fallback - test_append_only_fallback_on_update — verifies no stale old-value rows remain - test_alter_enable_append_only — verifies correct data after INSERT via append-only path

P1-3 Details (Cascade Regression)

Added assert_st_matches_query to two tests: - test_st_on_st_cascade_propagates_delete — compares order_report against its defining query post-DELETE - test_three_layer_cascade_insert_propagates — compares big_categories against category_flags WHERE is_big = true post-INSERT

P1-4 Details (Bootstrap Gating)

Added assert_st_matches_query to two tests: - test_manual_refresh_works_through_full_lifecycle — verifies all 3 rows correct after full gate/ungate/re-gate cycle - test_manual_refresh_not_blocked_by_gate — verifies both rows correct after gated manual refresh

Remaining Work

Priority	Item	Status
P2-1	Add smoke correctness check to benchmarks (32 tests)	Not started
P2-2	Add ALTER QUERY + DML cycle tests	Not started
P2-3	Add upgrade chain data validation	Not started
P2-4	Add non-convergence test with guaranteed divergence	Not started
P3-1	Consolidate cascade value checks to multiset	Not started
P3-2	Add DELETE/UPDATE to bootstrap gating tests	Not started
P3-3	Standardise bgworker test assertions	Not started

Implementation Status
Executive Summary
Test Infrastructure
Per-File Analysis
Cross-Cutting Findings
Priority Mitigations
Appendix: Coverage Matrix
Priority Mitigations
Appendix: Coverage Matrix

Executive Summary

The full E2E test suite consists of 222 test functions across 18 files (~11,000 lines). These tests require the custom Docker image built from tests/Dockerfile.e2e with the compiled extension, background worker, shared_preload_libraries, and GUC support. They run via just test-e2e (CI: push to main + daily schedule + manual dispatch; skipped on PRs).

Confidence level: MODERATE (≈65%)

Strength Distribution

Verdict	Files	Tests	% of Total
STRONG	4	40	18%
ADEQUATE	9	122	55%
WEAK	5	60	27%

Files Using `assert_st_matches_query` (Multiset Comparison)

File	Calls	Tests w/ Multiset
`e2e_differential_gaps_tests`	39	13/13 (100%)
`e2e_multi_cycle_tests`	21	6/9 (67%)
`e2e_guc_variation_tests`	10	8/13 (62%)
`e2e_dag_autorefresh_tests`	8	4/5 (80%)
`e2e_bgworker_tests`	2	2/9 (22%)
`e2e_user_trigger_tests`	2	2/11 (18%)
`e2e_alter_query_tests`	1	1/15 (7%)
`e2e_upgrade_tests`	1	1/14 (7%)
8 files with ZERO	0	0/138 (0%)
TOTAL	84	37/222 (17%)

83% of full-E2E tests do NOT use multiset comparison for data correctness.

Strengths

Area	Assessment
UDA + nested OR differential gaps	Exceptional — 13/13 tests with multiset, full DML cycles
Multi-cycle cumulative correctness	Strong — 5+ DML cycles with multiset at each checkpoint
DAG autorefresh cascades	Strong — 3-4 layer topologies with multiset at all layers
GUC variation correctness	Strong — 8 GUC configurations validated with multiset
DDL event detection	Good — 14 tests covering ADD/DROP/ALTER column, function changes, RENAME
Bootstrap gating lifecycle	Good — 18 tests covering full gate → ungate → re-gate cycle

Weaknesses

Severity	Finding	Impact
CRITICAL	10 files (138 tests) have ZERO multiset comparison	Data corruption undetectable in partition, RLS, WAL CDC, circular, DDL event, append-only, bootstrap gating, cascade regression, bench, and ergonomics tests
HIGH	Partition tests rely on `db.count()` only	All 5 partition types (RANGE/LIST/HASH + aggregation) unverified for row correctness
HIGH	WAL CDC data capture tests use count only	WAL INSERT/UPDATE/DELETE correctness never verified at row level
HIGH	Circular ST data correctness never verified	Cycle convergence could produce wrong data; only metadata (scc_id, status) checked
MEDIUM	Cascade regression tests miss multiset on 3-layer chains	Test 6 (3-layer) only counts; tests 2, 7 use partial data checks
MEDIUM	Benchmark tests (32) have zero correctness assertions	Performance measured on potentially incorrect results
MEDIUM	RLS tests don’t verify row-level filtering	Test 3 runs as superuser (bypasses RLS); no restricted-user query
LOW	Ergonomics tests are metadata-only	By design — API contract tests, not data tests

Test Infrastructure

Full E2E Docker Image

Docker image: Built from tests/Dockerfile.e2e, includes: - PostgreSQL 18.x with the compiled pg_trickle extension - shared_preload_libraries = 'pg_trickle' configured - Background worker active - All GUCs available

Test harness: tests/e2e/mod.rs provides TestDb with: - create_st() / refresh_st() / drop_st() — extension function wrappers - assert_st_matches_query(st_name, query) — EXCEPT-based multiset comparison that auto-discovers columns, handles json→text casts, and filters internal __pgt_* columns. Supports EXCEPT/INTERSECT set-operation visibility filters. - wait_for_scheduler() — polls until background worker completes a refresh - Full sqlx::PgPool access for arbitrary SQL

Why These Tests Need the Full Image

These 18 files test capabilities that require the compiled extension binary: - Background worker / scheduler (bgworker, dag_autorefresh) - GUC variables (guc_variation, bootstrap_gating) - DDL event triggers (ddl_event) - WAL-based CDC with logical replication (wal_cdc) - Extension upgrade paths (upgrade) - Row-level security interaction (rls) - Partition ATTACH/DETACH triggers (partition) - Circular dependency / SCC detection (circular) - Append-only optimization (append_only) - User-defined trigger interaction (user_trigger) - CDC benchmarks (bench)

Per-File Analysis

1. `e2e_alter_query_tests.rs` — 578 lines, 15 tests

Purpose: Validates ALTER QUERY operations (changing a stream table’s defining query in-place).

Test	What It Validates	Assertion Quality
`test_alter_query_same_schema`	Same-schema query change with WHERE clause	✅ STRONG — `assert_st_matches_query`
`test_alter_query_same_schema_differential`	ALTER on DIFFERENTIAL mode ST	⚠️ Count only
`test_alter_query_add_column`	Adding a column to the query	⚠️ Spot-checks one value
`test_alter_query_remove_column`	Removing a column	⚠️ Column existence only
`test_alter_query_type_change_compatible`	INT → BIGINT type change	⚠️ Status + count
`test_alter_query_type_change_incompatible`	INT → TEXT triggers rebuild	⚠️ OID changed, count only
`test_alter_query_change_sources`	Change to different source tables	⚠️ Dependency count only
`test_alter_query_remove_source`	Remove a source dependency	⚠️ Dependency check
`test_alter_query_pgt_count_transition`	Flat → aggregate query transition	⚠️ Count only
`test_alter_query_with_mode_change`	Simultaneous query + mode change	⚠️ Status + count
`test_alter_query_invalid_query`	Invalid query rejected	✅ Error path
`test_alter_query_cycle_detection`	Cyclic deps rejected	✅ Error path
`test_alter_query_view_inlining`	Views inlined in catalog	⚠️ Catalog check
`test_alter_query_oid_stable_same_schema`	OID preserved for same-schema ALTER	✅ OID comparison
`test_alter_query_catalog_updated`	Catalog query updated	✅ Query text comparison

Verdict: ADEQUATE

Gaps: - Only 1/15 tests uses multiset comparison - After ALTER to aggregate/join queries, data correctness not verified - No ALTER + DML cycle (INSERT → ALTER → refresh → verify)

2. `e2e_append_only_tests.rs` — 342 lines, 10 tests

Purpose: Validates the append-only optimization (INSERT-only fast path) and fallback to MERGE on UPDATE/DELETE.

Test	What It Validates	Assertion Quality
`test_append_only_basic_insert_path`	Flag set, row count correct	⚠️ Count only
`test_append_only_data_correctness`	Multi-cycle correctness	⚠️ SUM aggregate only
`test_append_only_fallback_on_delete`	DELETE triggers fallback to MERGE	⚠️ Flag check + count
`test_append_only_fallback_on_update`	UPDATE triggers fallback	⚠️ Spot-checks one value
`test_alter_enable_append_only`	ALTER to enable append_only	⚠️ Flag + count
`test_append_only_rejected_for_full_mode`	FULL mode rejects append_only	✅ Error validation
`test_append_only_rejected_for_immediate_mode`	IMMEDIATE mode rejects	✅ Error validation
`test_append_only_rejected_for_keyless_source`	Keyless table rejects	✅ Error validation
`test_alter_append_only_rejected_for_full_mode`	ALTER rejects on FULL	✅ Error validation
`test_append_only_no_data_cycle`	No-data cycle is idempotent	⚠️ Count only

Verdict: ADEQUATE

Key gap: Zero multiset comparisons. After fallback from append-only to MERGE, data correctness should be verified with assert_st_matches_query. Test 2 uses SUM for basic verification but can’t detect wrong individual rows.

3. `e2e_bench_tests.rs` — 2,156 lines, 32 tests (all `#[ignore]`)

Purpose: Performance benchmarks measuring refresh latency across query types (scan, filter, aggregate, join, window, lateral, CTE, UNION), sizes (10K–100K rows), and change rates (1%–50%).

All 32 tests are #[ignore]-gated and timer-based. They measure TPS, p50/p99 latency, and overhead percentages.

Test Category	Count	Assertion Type
Scan benchmarks	9	⚠️ Timing only
Filter/aggregate/join/window benchmarks	12	⚠️ Timing only
No-data refresh latency	1	⚠️ avg < 10ms target
Index overhead	1	⚠️ Overhead %
CDC trigger overhead	2	⚠️ Timing comparison
Statement vs row CDC	2	⚠️ Timing comparison
Concurrent writers	1	⚠️ Throughput
Full matrix sweeps	4	⚠️ Timing aggregation

Verdict: WEAK (by design — benchmarks, not correctness tests)

Gap: No data correctness assertions anywhere. Row counts are logged but never asserted. If a DVM bug causes incorrect results, benchmarks will still report normal timing.

Recommendation: Add a smoke-test assertion at the end of each benchmark variant: after the final cycle, call assert_st_matches_query once. This adds negligible overhead to the benchmark but catches correctness regressions.

4. `e2e_bgworker_tests.rs` — 570 lines, 9 tests

Purpose: Validates the background worker / scheduler: extension loading, GUC registration, auto-refresh, differential mode, history records, catalog metadata updates.

Test	What It Validates	Assertion Quality
`test_extension_loads_with_shared_preload`	Extension present in pg_extension	✅ Setup validation
`test_gucs_registered`	8 GUC defaults correct	✅ 8 SHOW comparisons
`test_gucs_can_be_altered`	GUCs changeable via ALTER SYSTEM	✅ 5 ALTER + SHOW
`test_auto_refresh_within_schedule`	Scheduler fires within threshold	⚠️ Count only
`test_auto_refresh_differential_mode`	Differential auto-refresh correct	✅ STRONG — `assert_st_matches_query`
`test_scheduler_writes_refresh_history`	History records created	⚠️ History count
`test_auto_refresh_differential_with_cdc`	CDC + differential auto-refresh	✅ STRONG — `assert_st_matches_query`
`test_scheduler_refreshes_multiple_healthy_sts`	Multiple STs refreshed in one tick	⚠️ Count checks
`test_auto_refresh_updates_catalog_metadata`	Timestamps and error counts updated	⚠️ Metadata checks

Verdict: ADEQUATE

Strengths: Tests 5 and 7 use multiset comparison for real correctness. GUC validation thorough.

Gaps: Tests 4 and 8 (auto-refresh count, multiple STs) should use multiset.

5. `e2e_bootstrap_gating_tests.rs` — 637 lines, 18 tests

Purpose: Validates the bootstrap gating feature (source gates that block scheduler refreshes during initial data loads).

Test	What It Validates	Assertion Quality
`test_gate_source_inserts_gate_record`	Gate record created	⚠️ Metadata
`test_source_gates_returns_gated_source`	Function returns gated source	⚠️ Metadata
`test_ungate_source_clears_gate`	Ungate sets gated=false	⚠️ Metadata
`test_gate_source_is_idempotent`	Double-gate produces one record	⚠️ Count
`test_regate_after_ungate`	Re-gate after ungate works	⚠️ Metadata
`test_gate_source_nonexistent_table_errors`	Nonexistent table → error	✅ Error path
`test_source_gates_empty_by_default`	No gates initially	⚠️ Count
`test_multiple_sources_gated`	Multiple sources can be gated	⚠️ Count
`test_idempotent_gate_refreshes_timestamp`	Double-gate refreshes gated_at	⚠️ Timestamp
`test_idempotent_gate_preserves_state`	Double-gate preserves state	⚠️ Metadata
`test_regate_lifecycle_clears_ungated_at`	Re-gate clears ungated_at	⚠️ Metadata
`test_manual_refresh_works_through_full_lifecycle`	Manual refresh through gate cycle	⚠️ Count (1→2→3)
`test_bootstrap_gate_status_returns_expected_columns`	Status function columns	⚠️ Column check
`test_bootstrap_gate_status_ungated_duration`	Duration for ungated sources	⚠️ Metadata
`test_bootstrap_gate_status_affected_stream_tables`	Affected STs listed	⚠️ String contains
`test_bootstrap_gate_status_empty_by_default`	No gate status initially	⚠️ Count
`test_manual_refresh_not_blocked_by_gate`	Manual refresh bypasses gates	⚠️ Count
`test_scheduler_logs_skip_when_source_gated`	Scheduler SKIPs gated sources	✅ History action/status

Verdict: ADEQUATE

Gaps: Zero multiset comparisons. Tests 12 and 17 (manual refresh) should verify data content, not just count increments.

6. `e2e_cascade_regression_tests.rs` — 796 lines, 8 tests

Purpose: Regression tests for ST-on-ST cascade behavior: propagation of INSERT/UPDATE/DELETE through chained stream tables, zero-row refresh timestamp stability, and correct dependency type tracking.

Test	What It Validates	Assertion Quality
`test_cdc_triggers_not_counted_as_user_triggers`	CDC trigger exclusion in detection query	✅ Before/after logic
`test_st_on_st_cascade_propagates_insert`	INSERT cascades through ST chain	✅ Value comparison (300→450)
`test_st_on_st_cascade_propagates_delete`	DELETE cascades through ST chain	⚠️ EXISTS check only
`test_zero_row_differential_preserves_data_timestamp`	0-row refresh doesn’t bump timestamp	✅ STRONG — timestamp equality regression
`test_no_spurious_cascade_after_noop_upstream_refresh`	No-op upstream doesn’t cascade	✅ STRONG — timestamp stability
`test_three_layer_cascade_insert_propagates`	3-layer INSERT cascade	⚠️ Count only
`test_three_layer_cascade_update_propagates`	3-layer UPDATE cascade	✅ Category value comparison
`test_st_on_st_dependency_is_stream_table_type`	Dependency recorded as STREAM_TABLE	✅ Type string comparison

Verdict: ADEQUATE to STRONG

Strengths: Tests 2, 4, 5, 7 have genuine data validation (value comparisons, timestamp equality). Regression-focused.

Gaps: - Zero use of assert_st_matches_query — tests do ad-hoc data checks - Test 3 (DELETE cascade) only checks EXISTS, not full data - Test 6 (3-layer INSERT) only checks count

7. `e2e_circular_tests.rs` — 562 lines, 6 tests

Purpose: Validates circular/cyclic stream table dependencies using SCC (strongly connected component) detection, monotonicity checks, convergence, and drop cleanup.

Test	What It Validates	Assertion Quality
`test_circular_monotone_cycle_converges`	Monotone cycle creation + SCC ID	⚠️ Metadata only
`test_circular_nonmonotone_cycle_rejected`	Non-monotone cycle rejected	✅ Error message
`test_circular_convergence_records_iterations`	Iteration count recorded	⚠️ iterations ≥ 1 (loose)
`test_circular_nonconvergence_error_status`	Max iterations → ERROR	⚠️ Status check (timing-sensitive)
`test_circular_drop_member_clears_scc_id`	Drop member clears SCC IDs	⚠️ Metadata
`test_circular_default_rejects_cycles`	allow_circular=false rejects	✅ Error message

Verdict: WEAK

Critical gap: Zero multiset comparisons. All 6 tests validate only metadata (scc_id, status, iteration count) — none verify that the cyclic stream tables actually contain correct data after convergence. A cycle that converges to the wrong fixed point would pass all tests.

8. `e2e_dag_autorefresh_tests.rs` — 449 lines, 5 tests

Purpose: Validates automatic scheduler-driven refresh through multi-layer DAG topologies.

Test	What It Validates	Assertion Quality
`test_autorefresh_3_layer_cascade`	3-layer cascade auto-refresh	✅ STRONG — `assert_st_matches_query` at all 3 layers
`test_autorefresh_diamond_cascade`	Diamond topology auto-refresh	✅ STRONG — multiset on L2
`test_autorefresh_calculated_schedule`	CALCULATED schedule triggers	✅ STRONG — multiset after L1 refresh
`test_autorefresh_no_spurious_3_layer`	No spurious cascades on no-op	✅ Timestamp stability
`test_autorefresh_staggered_schedules`	Staggered schedules converge	✅ STRONG — multiset at all 3 layers

Verdict: STRONG

Exemplary file. 4/5 tests use assert_st_matches_query for full multiset comparison at every layer of the DAG. Test 4 (no-spurious) appropriately uses timestamp stability rather than data comparison.

9. `e2e_ddl_event_tests.rs` — 608 lines, 14 tests

Purpose: Validates DDL event trigger reactions: what happens to stream tables when source tables are altered (ADD/DROP/ALTER column, RENAME, DROP table, function changes, index creation).

Test	What It Validates	Assertion Quality
`test_drop_source_fires_event_trigger`	DROP source → ST error/cleanup	⚠️ Status/count
`test_alter_source_fires_event_trigger`	ALTER source → ST remains	⚠️ Count only
`test_drop_st_storage_by_sql`	DROP storage → catalog cleanup	⚠️ Count only
`test_rename_source_table`	RENAME source → refresh fails	✅ Error path
`test_function_change_marks_st_for_reinit`	Function change → needs_reinit	⚠️ Flag check
`test_drop_function_marks_st_for_reinit`	DROP function → needs_reinit	⚠️ Flag check
`test_add_column_on_source_st_still_functional`	ADD column (unused) → ST OK	⚠️ Count only
`test_add_column_unused_st_survives_refresh`	ADD + UPDATE → ST refreshes	⚠️ Count + spot value
`test_drop_unused_column_st_survives`	DROP column (unused) → ST OK	⚠️ Status + count
`test_alter_column_type_triggers_reinit`	ALTER TYPE → needs_reinit	⚠️ Flag check
`test_create_index_on_source_is_benign`	CREATE INDEX → no reinit	⚠️ Flag + count
`test_drop_source_with_multiple_downstream_sts`	DROP with 2+ downstream STs	⚠️ Status checks
`test_block_source_ddl_guc_prevents_alter`	block_source_ddl=on blocks ALTER	✅ Error + DML works
`test_add_column_on_joined_source_st_survives`	ADD column on joined source	⚠️ Status + count

Verdict: WEAK

Critical gap: Zero multiset comparisons across all 14 tests. After DDL changes (ADD/DROP/ALTER column, function replacement), stream table data is never verified. Tests confirm metadata flags (needs_reinit, status) but not whether the data is correct after the DDL-triggered reinit/refresh.

10. `e2e_differential_gaps_tests.rs` — 526 lines, 13 tests

Purpose: Validates DVM differential refresh for features that previously had gaps: user-defined aggregates (UDAs) and nested OR with EXISTS sublinks.

Test	What It Validates	Assertion Quality
`test_uda_simple_differential`	UDA INSERT/DELETE/UPDATE cycles	✅ STRONG — multiset after each DML
`test_uda_combined_with_builtin`	UDA + COUNT/SUM together	✅ STRONG — multiset
`test_uda_auto_mode_resolves_to_differential`	AUTO mode resolves correctly	✅ STRONG — mode + multiset
`test_uda_multiple_in_same_query`	Multiple UDAs in one query	✅ STRONG — multiset
`test_nested_or_two_exists`	OR with 2 EXISTS sublinks	✅ STRONG — multiset after each DML
`test_nested_or_mixed_and_or_under_or`	OR(a OR (b AND EXISTS))	✅ STRONG — multiset
`test_nested_or_cdc_cycle`	Complex OR+EXISTS + full CDC cycle	✅ STRONG — multiset after I/U/D
`test_nested_or_demorgan_not_and`	De Morgan NOT(AND+sublink)	✅ STRONG — multiset after I/U/D
`test_nested_or_demorgan_and_prefix`	AND prefix + NOT(AND+sublink)	✅ STRONG — multiset
`test_uda_with_filter_clause`	UDA with FILTER(WHERE …)	✅ STRONG — multiset
`test_uda_with_order_by_in_agg`	UDA with ORDER BY in aggregate	✅ STRONG — multiset
`test_uda_schema_qualified`	Schema-qualified UDA	✅ STRONG — multiset
`test_uda_insert_delete_update_full_cycle`	Full lifecycle: I→U→D→revival	✅ STRONG — multiset after each of 6 ops

Verdict: STRONG — EXEMPLARY

All 13 tests use assert_st_matches_query for full multiset comparison. Full DML cycles (INSERT, UPDATE, DELETE) with verification at each step. This is the gold standard for the test suite.

11. `e2e_guc_variation_tests.rs` — 430 lines, 13 tests

Purpose: Validates that non-default GUC configurations produce correct results.

Test	What It Validates	Assertion Quality
`test_guc_prepared_statements_off`	prepared_statements=OFF	✅ STRONG — multiset
`test_guc_merge_planner_hints_off`	merge_planner_hints=OFF	✅ STRONG — multiset
`test_guc_cleanup_use_truncate_off`	cleanup_use_truncate=OFF	✅ STRONG — multiset
`test_guc_merge_work_mem_mb_custom`	merge_work_mem_mb=16	✅ STRONG — multiset
`test_guc_block_source_ddl_on`	block_source_ddl=ON prevents DDL	✅ STRONG — error + multiset
`test_guc_differential_max_change_ratio_zero`	max_change_ratio=0.0	✅ STRONG — mode + multiset
`test_guc_combined_non_default`	Multiple GUCs at once	✅ STRONG — multiset
`test_guc_max_grouping_set_branches_rejects_over_limit`	CUBE limit exceeded	✅ Error validation
`test_guc_max_grouping_set_branches_allows_within_limit`	CUBE within limit	⚠️ Creation only
`test_guc_max_grouping_set_branches_raised_allows_large_cube`	Raised CUBE limit	⚠️ Creation only
`test_guc_foreign_table_polling_off_rejects_differential`	Foreign table polling rejected	✅ Error validation
`test_guc_foreign_table_polling_full_mode_no_guc_needed`	Foreign table FULL mode	⚠️ Creation only
`test_guc_foreign_table_polling_on_allows_differential`	Foreign table polling enabled	✅ STRONG — multiset after I/D

Verdict: STRONG

8/13 tests use multiset comparison. The 5 without it are boundary/error tests where creation success/failure is the primary assertion. Minor gap: CUBE limit tests only verify creation, not query result correctness.

12. `e2e_multi_cycle_tests.rs` — 534 lines, 9 tests

Purpose: Validates cumulative correctness across multiple refresh cycles with different DML operations and cache behaviors.

Test	What It Validates	Assertion Quality
`test_multi_cycle_aggregate_differential`	5 cycles: I→U→D→mixed→no-op	✅ STRONG — multiset after each
`test_multi_cycle_join_differential`	4 JOIN cycles with left/right DML	✅ STRONG — multiset after each
`test_multi_cycle_window_differential`	5 INSERT + 2 DELETE cycles	✅ STRONG — multiset after each
`test_multi_cycle_prepared_statement_cache`	7 cycles, cache survives	✅ STRONG — multiset after each
`test_prepared_statements_cleared_after_cache_invalidation`	Cache invalidated on ALTER	⚠️ Scalar total + cache count
`test_multi_cycle_group_elimination_revival`	Group elimination + revival	✅ STRONG — multiset after each
`test_ec16_function_body_change_marks_reinit`	Function change → reinit + correct data	✅ Explicit sum validation (60→70→108)
`test_ec16_function_change_full_refresh_recovery`	Function change recovery	✅ Explicit sum validation (215→836)
`test_ec16_no_functions_unaffected`	Unchanged STs unaffected	⚠️ Flag + count

Verdict: STRONG

6/9 tests use multiset comparison with multi-step DML cycles. The EC-16 tests use explicit sum validation which is adequate for verifying new function logic is applied.

13. `e2e_partition_tests.rs` — 554 lines, 9 tests

Purpose: Validates stream tables built on partitioned source tables (RANGE, LIST, HASH) and on foreign tables via postgres_fdw.

Test	What It Validates	Assertion Quality
`test_partition_range_full_refresh`	RANGE partition + FULL	⚠️ Count only
`test_partition_range_differential_refresh`	RANGE + INSERT/UPDATE/DELETE cycle	⚠️ Count checks
`test_partition_list_source`	LIST partition	⚠️ Count only
`test_partition_hash_source`	HASH partition	⚠️ Count only
`test_partition_attach_triggers_reinit`	ATTACH → needs_reinit	⚠️ Flag + count
`test_partition_detach_triggers_reinit`	DETACH → needs_reinit	⚠️ Flag + count
`test_foreign_table_full_refresh_works`	Foreign table via postgres_fdw	⚠️ Count only
`test_partition_with_aggregation`	Partitioned + GROUP BY	⚠️ Scalar sum
`test_partition_differential_with_aggregation`	Partitioned + GROUP BY + INSERT	⚠️ Scalar sum

Verdict: WEAK

Zero multiset comparisons. All 9 tests rely on db.count() or scalar aggregate checks. Test 2 has a full INSERT/UPDATE/DELETE cycle but never verifies the actual row content.

14. `e2e_phase4_ergonomics_tests.rs` — 577 lines, 20 tests

Purpose: Validates API ergonomics: manual refresh history, quick_health view, create_if_not_exists(), schedule defaults, removed GUCs, ALTER warnings.

Test Group	Count	What It Validates	Assertion Quality
ERG-D (refresh history)	3	`initiated_by='MANUAL'`, status/end_time	⚠️ Metadata
ERG-E (quick_health)	3	View returns correct status	⚠️ Metadata
COR-2 (create_if_not_exists)	3	Idempotent creation	⚠️ Count/status
ERG-T1 (schedule defaults)	5	‘calculated’ default, NULL rejection	✅ Error + metadata
ERG-T2 (removed GUCs)	2	Old GUCs properly missing	✅ Error validation
ERG-T3 (ALTER warnings)	4	Warnings emitted on mode/query changes	⚠️ Notice text

Verdict: ADEQUATE (by design — API contract tests, not data tests)

These tests are appropriately metadata-focused. They test the API surface, not data correctness. No multiset comparison needed.

15. `e2e_rls_tests.rs` — 453 lines, 9 tests

Purpose: Validates Row-Level Security interaction with stream tables: RLS on source, RLS on ST, change buffer security, trigger SECURITY DEFINER, and DDL event detection for RLS changes.

Test	What It Validates	Assertion Quality
`test_rls_on_source_does_not_filter_stream_table`	RLS on source → ST sees all rows	⚠️ Count only
`test_rls_on_source_differential_mode`	RLS + DIFFERENTIAL + INSERT cycle	⚠️ Count only
`test_rls_on_stream_table_filters_reads`	RLS policy on ST (superuser)	⚠️ Count only
`test_rls_on_stream_table_immediate_mode`	IMMEDIATE + RLS on ST	⚠️ Count only
`test_change_buffer_rls_disabled`	relrowsecurity=false on buffer	⚠️ Boolean check
`test_ivm_trigger_functions_security_definer`	Triggers are SECURITY DEFINER	⚠️ Boolean + search_path
`test_enable_rls_on_source_triggers_reinit`	ENABLE RLS → needs_reinit	⚠️ Flag check
`test_disable_rls_on_source_triggers_reinit`	DISABLE RLS → needs_reinit	⚠️ Flag check
`test_force_rls_on_source_triggers_reinit`	FORCE RLS → needs_reinit	⚠️ Flag check

Verdict: WEAK

Zero multiset comparisons. All tests use count or flag assertions.

Significant gap: Test 3 (test_rls_on_stream_table_filters_reads) claims to test RLS filtering but runs as superuser, who bypasses RLS by default. The test should query as a restricted role to verify that RLS actually filters rows.

16. `e2e_upgrade_tests.rs` — 871 lines, 14 tests (7 active, 7 `#[ignore]`)

Purpose: Validates extension upgrade paths: schema stability, round-trip (DROP + CREATE), version consistency, and upgrade chain survival.

Test	What It Validates	Assertion Quality
`test_upgrade_catalog_schema_stability`	31 expected columns present	✅ STRONG — column list
`test_upgrade_catalog_indexes_present`	Expected indexes exist	⚠️ EXISTS checks
`test_upgrade_drop_recreate_roundtrip`	DROP CASCADE + CREATE round-trip	✅ STRONG — `assert_st_matches_query`
`test_upgrade_extension_version_consistency`	Version matches	✅ String comparison
`test_upgrade_dependencies_schema_stability`	Dependencies schema stable	⚠️ Column list
`test_upgrade_event_triggers_installed`	Event triggers exist	⚠️ EXISTS
`test_upgrade_monitoring_views_present`	Views queryable	⚠️ Queryability
`test_upgrade_chain_new_functions_exist`	(#[ignore]) Functions callable	⚠️ Existence
`test_upgrade_chain_stream_tables_survive`	(#[ignore]) STs survive upgrade	⚠️ Count only
`test_upgrade_chain_views_queryable`	(#[ignore]) Views work post-upgrade	⚠️ Queryability
`test_upgrade_chain_event_triggers_present`	(#[ignore]) Triggers exist	⚠️ EXISTS
`test_upgrade_chain_version_consistency`	(#[ignore]) Version correct	⚠️ String
`test_upgrade_chain_function_parity_with_fresh_install`	(#[ignore]) Function count matches	⚠️ Count
`test_upgrade_schema_additions_from_sql`	All SQL scripts parsed + verified	✅ STRONG — regex-based

Verdict: ADEQUATE

Strength: Test 3 (round-trip) uses assert_st_matches_query. Test 14 (SQL script verification) is comprehensive.

Gap: The 7 #[ignore] upgrade chain tests only use count/existence — none verify data correctness post-upgrade.

17. `e2e_user_trigger_tests.rs` — 649 lines, 11 tests

Purpose: Validates user-defined trigger interaction with stream table refresh: audit triggers, GUC control, BEFORE trigger modification, and MERGE vs explicit DML path selection.

Test	What It Validates	Assertion Quality
`test_explicit_dml_insert`	Audit on INSERT: NEW captured	⚠️ Audit field-level
`test_explicit_dml_update`	Audit on UPDATE: OLD/NEW captured	⚠️ Audit field-level
`test_explicit_dml_delete`	Audit on DELETE: OLD captured	⚠️ Audit field-level
`test_explicit_dml_no_op_skip`	IS DISTINCT FROM prevents no-op trigger	⚠️ Count check
`test_no_trigger_uses_merge`	No triggers → MERGE path + correct data	✅ STRONG — `assert_st_matches_query`
`test_trigger_audit_trail`	Mixed I/U/D + audit + data correctness	✅ STRONG — multiset + audit counts
`test_guc_off_suppresses_triggers`	GUC ‘off’ → audit empty	⚠️ Audit emptiness
`test_guc_auto_detects_triggers`	GUC ‘auto’ → triggers fire	⚠️ Audit count
`test_guc_on_alias_detects_triggers`	Deprecated ‘on’ alias works	⚠️ Audit count
`test_full_refresh_suppresses_triggers`	FULL refresh → no row triggers	⚠️ Audit emptiness
`test_before_trigger_modifies_new`	BEFORE trigger modifies NEW value	⚠️ Scalar value

Verdict: ADEQUATE to STRONG

Tests 5 and 6 use multiset comparison — test 6 is especially good, combining audit trail validation with data correctness.

18. `e2e_wal_cdc_tests.rs` — 729 lines, 17 tests

Purpose: Validates WAL-based CDC (logical replication): mode transitions, INSERT/UPDATE/DELETE capture, fallback to triggers, cleanup on DROP, keyless table handling, and health checks.

Test	What It Validates	Assertion Quality
`test_wal_auto_is_default_cdc_mode`	Default GUC = ‘auto’	⚠️ String
`test_wal_level_is_logical`	Container has wal_level=logical	⚠️ String
`test_explicit_wal_override_transitions_even_with_global_trigger`	Force WAL despite trigger GUC	⚠️ Mode check
`test_explicit_trigger_override_blocks_wal_transition`	Force TRIGGER prevents WAL	⚠️ Mode check
`test_wal_transition_lifecycle`	TRIGGER→TRANSITIONING→WAL + slot/pub	⚠️ Mode + infrastructure
`test_wal_cdc_captures_insert`	INSERT captured via WAL	⚠️ Count only
`test_wal_cdc_captures_update`	UPDATE captured via WAL	⚠️ Count + scalar
`test_wal_cdc_captures_delete`	DELETE captured via WAL	⚠️ Count only
`test_trigger_mode_no_wal_transition`	cdc_mode=‘trigger’ stays trigger	⚠️ Mode check
`test_wal_fallback_on_missing_slot`	Slot dropped → fallback + data survives	⚠️ Mode + count
`test_wal_cleanup_on_drop`	DROP ST → slot + pub cleaned	⚠️ Infrastructure
`test_wal_keyless_table_stays_on_triggers`	Keyless → stays trigger	⚠️ Mode check
`test_ec18_check_cdc_health_shows_trigger_for_stuck_auto`	EC-18: keyless auto → TRIGGER	⚠️ Health check
`test_ec18_health_check_ok_with_trigger_auto_sources`	EC-18: no errors for trigger auto	⚠️ Count
`test_ec34_check_cdc_health_detects_missing_slot`	EC-34: missing slot alert + fallback	⚠️ Alert + mode + count
`test_ec19_wal_keyless_without_replica_identity_full_rejected`	Keyless + no RIF rejected	✅ Error validation
`test_ec19_wal_keyless_with_replica_identity_full_accepted`	Keyless + RIF accepted	⚠️ Mode check

Verdict: ADEQUATE for CDC mode transitions, WEAK for WAL data correctness

Critical gap: Zero multiset comparisons. Tests 6–8 (INSERT/UPDATE/DELETE via WAL CDC) only verify count or scalar values — they never verify the actual captured data matches the source. A WAL decoding bug that produces wrong column values would pass all tests.

Cross-Cutting Findings

Finding 1: Multiset Comparison Usage is Bimodal

The suite splits sharply into two camps:

Files with strong multiset coverage (≥60%): - e2e_differential_gaps_tests — 13/13 (100%) - e2e_dag_autorefresh_tests — 4/5 (80%) - e2e_multi_cycle_tests — 6/9 (67%) - e2e_guc_variation_tests — 8/13 (62%)

Files with weak/no multiset coverage (≤22%): - e2e_ddl_event_tests — 0/14 (0%) - e2e_circular_tests — 0/6 (0%) - e2e_partition_tests — 0/9 (0%) - e2e_rls_tests — 0/9 (0%) - e2e_wal_cdc_tests — 0/17 (0%) - e2e_append_only_tests — 0/10 (0%) - e2e_bootstrap_gating_tests — 0/18 (0%) - e2e_bench_tests — 0/32 (0%) - e2e_cascade_regression_tests — 0/8 (0%) (though uses ad-hoc value checks) - e2e_bgworker_tests — 2/9 (22%)

This suggests the multiset pattern was adopted partway through development. Files written earlier or focused on infrastructure tend to lack it.

Finding 2: Count-Only Tests Create False Confidence

62 tests use db.count() as their primary data assertion. This catches: - ✅ Missing rows (count too low) - ✅ Duplicate rows (count too high)

But cannot catch: - ❌ Wrong column values - ❌ Wrong row composition (right count, wrong data) - ❌ NULL corruption - ❌ Type coercion bugs

For example, a partition test that verifies count = 3 would pass even if all three rows have incorrect values derived from the wrong partition.

Finding 3: WAL CDC Data Path is Unvalidated

The 17 WAL CDC tests thoroughly validate mode transitions (TRIGGER → WAL), infrastructure (slots, publications), and fallback behavior. But the actual data path — whether WAL-decoded INSERTs/UPDATEs/DELETEs produce correct stream table content — is verified with counts only.

This is a significant blind spot because WAL decoding involves complex binary parsing of the replication stream, and a subtle bug could produce wrong values that pass all count assertions.

Finding 4: DDL Event Tests Missing Post-Reinit Validation

When a DDL change (ALTER COLUMN TYPE, function replacement, RLS change) marks a stream table as needs_reinit, the tests verify: - ✅ The needs_reinit flag is set - ⚠️ The reinit can execute (sometimes) - ❌ The data after reinit is correct (never)

This means the DDL detection works, but whether the recovery path produces correct data is untested at the full E2E level.

Finding 5: RLS Test Has a Superuser Bypass Flaw

test_rls_on_stream_table_filters_reads intends to verify that RLS filters rows when querying a stream table. However, it appears to run queries as the superuser, who bypasses RLS by default. The test should: 1. Create a restricted role 2. Enable RLS on the stream table 3. Query as the restricted role 4. Verify filtered results

Finding 6: Benchmark Tests as Silent Correctness Regression Vector

The 32 benchmark tests (#[ignore]) exercise all major query types (scan, filter, aggregate, join, window, lateral, CTE, UNION) with real DML cycles and multi-cycle refreshes. Yet none assert data correctness. These tests are actually exercising the most complex code paths in the DVM engine — adding a single assert_st_matches_query call at the end of each benchmark would be extremely high-value with negligible performance impact.

Priority Mitigations

P0 — Critical (Data Integrity Gaps)

P0-1: Add Multiset Comparison to WAL CDC Data Tests

Tests 6–8 (captures_insert, captures_update, captures_delete) should verify data correctness after WAL-captured changes:

// Current (WEAK):
let count: i64 = db.count("wal_st").await;
assert_eq!(count, 3);

// Proposed (STRONG):
db.assert_st_matches_query("wal_st", "SELECT id, val FROM wal_source").await;

Also add multiset to test 10 (fallback) and test 15 (EC-34 missing slot).

Impact: 5 tests converted from weak to strong. Validates the entire WAL decoding → change buffer → differential refresh pipeline.

P0-2: Add Multiset to Partition Tests

All non-foreign-table tests should use assert_st_matches_query:

// For each partition type (RANGE, LIST, HASH):
db.assert_st_matches_query("part_st", "SELECT id, val FROM part_source").await;

// For aggregation tests:
db.assert_st_matches_query("part_agg_st",
    "SELECT region, SUM(amount) FROM part_sales GROUP BY region"
).await;

Impact: 7 tests converted. Validates partition pruning doesn’t corrupt results.

P0-3: Add Multiset to DDL Event Post-Reinit Tests

After setting needs_reinit and triggering reinit, verify data:

// After function change + reinit:
db.refresh_st("fn_st").await; // triggers reinit
db.assert_st_matches_query("fn_st", "SELECT id, my_func(val) FROM source").await;

// After ALTER COLUMN TYPE + reinit:
db.refresh_st("col_st").await;
db.assert_st_matches_query("col_st", "SELECT id, val::new_type FROM source").await;

Impact: 4–6 tests improved. Validates that DDL recovery produces correct data.

P0-4: Add Data Verification to Circular ST Tests

After cycle convergence, verify actual data content:

db.assert_st_matches_query("cyc_a",
    "SELECT DISTINCT src, dst FROM expected_transitive_closure"
).await;

Impact: 2 tests improved. Validates convergence correctness, not just convergence detection.

P1 — High (Coverage Hardening)

P1-1: Fix RLS Superuser Bypass in Test

Add a restricted role and query as that role:

db.execute("CREATE ROLE rls_reader").await;
db.execute("GRANT SELECT ON rls_st TO rls_reader").await;
db.execute("SET ROLE rls_reader").await;
let count: i64 = db.count("rls_st").await;
assert_eq!(count, expected_filtered_count);
db.execute("RESET ROLE").await;

Impact: Validates actual RLS filtering, not just that RLS is enabled.

P1-2: Add Multiset to Append-Only Fallback Tests

After fallback from append-only to MERGE:

db.assert_st_matches_query("ao_st", "SELECT id, val FROM ao_source").await;

Impact: 3 tests improved. Validates fallback produces correct data.

P1-3: Add Multiset to Cascade Regression Tests

Tests 3 and 6 (DELETE cascade, 3-layer INSERT) should use multiset:

// 3-layer cascade:
db.assert_st_matches_query("l3_st",
    "SELECT id, val * 2 + 10 FROM base_source"
).await;

Impact: 2 tests improved.

P1-4: Add Multiset to Bootstrap Gating Refresh Tests

Tests 12 and 17 (manual refresh through gate lifecycle):

db.assert_st_matches_query("gated_st", "SELECT id, val FROM gated_source").await;

Impact: 2 tests improved.

P2 — Medium (Completeness)

P2-1: Add Smoke Correctness Check to Benchmarks

At the end of each benchmark variant, add one assert_st_matches_query:

// After final benchmark cycle:
db.assert_st_matches_query(&st_name, &defining_query).await;

This adds ~50ms per benchmark but catches DVM correctness regressions during performance testing.

Impact: 32 tests gain correctness assertion. Extremely high value.

P2-2: Add ALTER QUERY + DML Cycle Tests

e2e_alter_query_tests needs tests that: 1. Create ST, populate with data 2. ALTER QUERY to join/aggregate 3. Refresh 4. Verify with assert_st_matches_query

Currently, ALTER tests verify schema changes succeed but not data correctness for complex query transformations.

P2-3: Add Upgrade Chain Data Validation

The 7 #[ignore] upgrade chain tests should add assert_st_matches_query after verifying STs survive the upgrade:

// After upgrade:
db.assert_st_matches_query("pre_upgrade_st",
    "SELECT id, val FROM pre_upgrade_source"
).await;

P2-4: Add Non-Convergence Test with Guaranteed Divergence

test_circular_nonconvergence_error_status should use DML that guarantees divergence (e.g., monotonically increasing counts) rather than relying on timing.

P3 — Low (Polish)

P3-1: Consolidate Cascade Value Checks to Multiset

e2e_cascade_regression_tests uses ad-hoc value comparisons (amount “450”, categories [“X”, “Y”]). Replace with assert_st_matches_query for consistency with the rest of the suite.

P3-2: Add DELETE/UPDATE to Bootstrap Gating Tests

Current gating tests only INSERT. Add UPDATE and DELETE during the gate → ungate → re-gate lifecycle.

P3-3: Standardize bgworker Test Assertions

Tests 4 and 8 (auto-refresh within schedule, multiple STs) use count only. Add multiset comparison for consistency.

Appendix: Coverage Matrix

Full E2E Files: Summary Table

File	Lines	Tests	Multiset Calls	Multiset %	DML Cycle?	Verdict
`e2e_differential_gaps_tests`	526	13	39	100%	✅ Full I/U/D	STRONG
`e2e_dag_autorefresh_tests`	449	5	8	80%	✅ Insert cycle	STRONG
`e2e_multi_cycle_tests`	534	9	21	67%	✅ Full I/U/D	STRONG
`e2e_guc_variation_tests`	430	13	10	62%	✅ Insert/delete	STRONG
`e2e_cascade_regression_tests`	796	8	0	0%*	✅ I/U/D	ADEQUATE
`e2e_bgworker_tests`	570	9	2	22%	✅ Insert	ADEQUATE
`e2e_user_trigger_tests`	649	11	2	18%	✅ Full I/U/D	ADEQUATE
`e2e_alter_query_tests`	578	15	1	7%	⚠️ Limited	ADEQUATE
`e2e_upgrade_tests`	871	14	1	7%	⚠️ Round-trip	ADEQUATE
`e2e_bootstrap_gating_tests`	637	18	0	0%	⚠️ Insert only	ADEQUATE
`e2e_phase4_ergonomics_tests`	577	20	0	N/A	❌ Metadata	ADEQUATE
`e2e_append_only_tests`	342	10	0	0%	⚠️ Insert + fallback	ADEQUATE
`e2e_ddl_event_tests`	608	14	0	0%	⚠️ DDL only	WEAK
`e2e_wal_cdc_tests`	729	17	0	0%	⚠️ Single DML	WEAK
`e2e_partition_tests`	554	9	0	0%	⚠️ Limited I/U/D	WEAK
`e2e_circular_tests`	562	6	0	0%	❌ No DML verify	WEAK
`e2e_rls_tests`	453	9	0	0%	⚠️ Insert only	WEAK
`e2e_bench_tests`	2,156	32	0	0%	✅ Multi-cycle	WEAK
TOTAL	~11,021	222	84	17%	—	—

* e2e_cascade_regression_tests uses ad-hoc value checks instead of assert_st_matches_query.

Assertion Type Distribution

Assertion Type	Test Count	%
`assert_st_matches_query` (multiset)	37	17%
Explicit value comparison	12	5%
Error path validation	22	10%
Metadata / flag / status	68	31%
Count only (`db.count()`)	62	28%
Timing / benchmark	32	14%
Total	222	—

Feature Coverage by Test File

Feature	Test File(s)	Coverage Level
Differential refresh (core)	differential_gaps, multi_cycle	✅ Strong
DAG cascade + autorefresh	dag_autorefresh	✅ Strong
GUC configurability	guc_variation	✅ Strong
ALTER QUERY operations	alter_query	⚠️ Adequate
Background worker / scheduler	bgworker	⚠️ Adequate
Bootstrap gating	bootstrap_gating	⚠️ Adequate
User-defined triggers	user_trigger	⚠️ Adequate
Extension upgrade paths	upgrade	⚠️ Adequate
ST-on-ST cascades	cascade_regression	⚠️ Adequate
Append-only optimization	append_only	⚠️ Adequate
API ergonomics	phase4_ergonomics	⚠️ Adequate (metadata)
WAL-based CDC	wal_cdc	❌ Weak (data path)
Partitioned tables	partition	❌ Weak
DDL event reactions	ddl_event	❌ Weak (post-reinit)
Circular dependencies	circular	❌ Weak
Row-Level Security	rls	❌ Weak
Performance benchmarks	bench	❌ Weak (no correctness)

PGXN

PostgreSQL Extension Network

Contents

Full E2E Test Suite — Deep Evaluation Report

Implementation Status

Completed Mitigations

P0-1 Details (WAL CDC)

P0-2 Details (Partitions)

P0-3 Details (DDL Events)

P0-4 Details (Circular)

P1-1 Details (RLS)

P1-2 Details (Append-Only)

P1-3 Details (Cascade Regression)

P1-4 Details (Bootstrap Gating)

Remaining Work

Table of Contents

Executive Summary

Strength Distribution

Files Using assert_st_matches_query (Multiset Comparison)

Strengths

Weaknesses

Test Infrastructure

Full E2E Docker Image

Why These Tests Need the Full Image

Per-File Analysis

1. e2e_alter_query_tests.rs — 578 lines, 15 tests

2. e2e_append_only_tests.rs — 342 lines, 10 tests

3. e2e_bench_tests.rs — 2,156 lines, 32 tests (all #[ignore])

4. e2e_bgworker_tests.rs — 570 lines, 9 tests

5. e2e_bootstrap_gating_tests.rs — 637 lines, 18 tests

6. e2e_cascade_regression_tests.rs — 796 lines, 8 tests

7. e2e_circular_tests.rs — 562 lines, 6 tests

8. e2e_dag_autorefresh_tests.rs — 449 lines, 5 tests

9. e2e_ddl_event_tests.rs — 608 lines, 14 tests

10. e2e_differential_gaps_tests.rs — 526 lines, 13 tests

11. e2e_guc_variation_tests.rs — 430 lines, 13 tests

12. e2e_multi_cycle_tests.rs — 534 lines, 9 tests

13. e2e_partition_tests.rs — 554 lines, 9 tests

14. e2e_phase4_ergonomics_tests.rs — 577 lines, 20 tests

15. e2e_rls_tests.rs — 453 lines, 9 tests

16. e2e_upgrade_tests.rs — 871 lines, 14 tests (7 active, 7 #[ignore])

17. e2e_user_trigger_tests.rs — 649 lines, 11 tests

18. e2e_wal_cdc_tests.rs — 729 lines, 17 tests

Cross-Cutting Findings

Finding 1: Multiset Comparison Usage is Bimodal

Finding 2: Count-Only Tests Create False Confidence

Finding 3: WAL CDC Data Path is Unvalidated

Finding 4: DDL Event Tests Missing Post-Reinit Validation

Finding 5: RLS Test Has a Superuser Bypass Flaw

Finding 6: Benchmark Tests as Silent Correctness Regression Vector

Priority Mitigations

P0 — Critical (Data Integrity Gaps)

P0-1: Add Multiset Comparison to WAL CDC Data Tests

P0-2: Add Multiset to Partition Tests

P0-3: Add Multiset to DDL Event Post-Reinit Tests

P0-4: Add Data Verification to Circular ST Tests

P1 — High (Coverage Hardening)

P1-1: Fix RLS Superuser Bypass in Test

P1-2: Add Multiset to Append-Only Fallback Tests

P1-3: Add Multiset to Cascade Regression Tests

P1-4: Add Multiset to Bootstrap Gating Refresh Tests

P2 — Medium (Completeness)

P2-1: Add Smoke Correctness Check to Benchmarks

P2-2: Add ALTER QUERY + DML Cycle Tests

P2-3: Add Upgrade Chain Data Validation

P2-4: Add Non-Convergence Test with Guaranteed Divergence

P3 — Low (Polish)

P3-1: Consolidate Cascade Value Checks to Multiset

P3-2: Add DELETE/UPDATE to Bootstrap Gating Tests

P3-3: Standardize bgworker Test Assertions

Appendix: Coverage Matrix

Full E2E Files: Summary Table

Assertion Type Distribution

Feature Coverage by Test File

Files Using `assert_st_matches_query` (Multiset Comparison)

1. `e2e_alter_query_tests.rs` — 578 lines, 15 tests

2. `e2e_append_only_tests.rs` — 342 lines, 10 tests

3. `e2e_bench_tests.rs` — 2,156 lines, 32 tests (all `#[ignore]`)

4. `e2e_bgworker_tests.rs` — 570 lines, 9 tests

5. `e2e_bootstrap_gating_tests.rs` — 637 lines, 18 tests

6. `e2e_cascade_regression_tests.rs` — 796 lines, 8 tests

7. `e2e_circular_tests.rs` — 562 lines, 6 tests

8. `e2e_dag_autorefresh_tests.rs` — 449 lines, 5 tests

9. `e2e_ddl_event_tests.rs` — 608 lines, 14 tests

10. `e2e_differential_gaps_tests.rs` — 526 lines, 13 tests

11. `e2e_guc_variation_tests.rs` — 430 lines, 13 tests

12. `e2e_multi_cycle_tests.rs` — 534 lines, 9 tests

13. `e2e_partition_tests.rs` — 554 lines, 9 tests

14. `e2e_phase4_ergonomics_tests.rs` — 577 lines, 20 tests

15. `e2e_rls_tests.rs` — 453 lines, 9 tests

16. `e2e_upgrade_tests.rs` — 871 lines, 14 tests (7 active, 7 `#[ignore]`)

17. `e2e_user_trigger_tests.rs` — 649 lines, 11 tests

18. `e2e_wal_cdc_tests.rs` — 729 lines, 17 tests