Contents
layout: default title: Declarative Partitioning
nav_order: 15
Spec: Declarative Partitioning Support
Status: implemented first pass
Risk tier: CAUTION
Primary goal: make sorted_heap operationally useful for very large logical
tables by supporting partition-scoped maintenance and observability.
Problem
sorted_heap currently works on concrete physical relations. That is a good
fit for one table at a time, but huge logical tables are commonly managed as
PostgreSQL declarative partitioned tables.
Today, the extension does not expose a first-class contract for:
- compacting all sorted_heap leaves under a partitioned parent;
- merging all sorted_heap leaves under a partitioned parent;
- rebuilding zone maps across a partition tree;
- inspecting per-leaf sorted_heap state from the parent;
- explaining how
SortedHeapScanshould behave after PostgreSQL partition pruning.
Fresh probe on PostgreSQL 18 showed the current split:
CREATE TABLE leaf PARTITION OF parent ... USING sorted_heapworks.sorted_heap_compact(leaf)works.- A direct query against the leaf can use
Custom Scan (SortedHeapScan). - Initial probe: the same predicate through the partitioned parent did not use
SortedHeapScan; it planned a leafSeq Scanwhen index/bitmap scans were disabled, or a disabled PKIndex Scanwhenenable_seqscan = off. - First-pass fix: the planner hook now accepts PostgreSQL partition child
member relations (
RELOPT_OTHER_MEMBER_REL) in addition to direct base relations. RegressionSH23-3verifies that a parent query reachesSortedHeapScanon the pruned sorted_heap leaf.
This makes the huge-table story incomplete: full-table rewrite compaction is operationally expensive, while partition-scoped compaction is the natural answer.
Non-Goals
The first partitioning pass will not:
- implement a global in-place compactor;
- compact multiple leaves atomically as one logical operation;
- transparently dispatch arbitrary filtered
sorted_hnswKNN through normalWHEREclauses; - automatically dispatch GraphRAG across a partitioned parent;
- introduce a global cross-partition HNSW index;
- silently support foreign partitions.
Those can be separate specs after the storage maintenance contract is stable.
Definitions
- Parent: a PostgreSQL declarative partitioned table.
- Leaf: a concrete relation under the partition tree that stores tuples.
- Supported leaf: a leaf whose table access method is
sorted_heapand whose relation has a primary key. - Unsupported leaf: a leaf with another table AM, a foreign table, or a relation kind that does not have sorted_heap storage metadata. A sorted_heap leaf without a primary key is also unsupported for maintenance helpers, because concrete compact/merge/rebuild need PK metadata.
Proposed User Contract
Leaf tables
Concrete partition leaves may use sorted_heap as their table access method.
Each leaf owns its own:
- block-0 meta page;
- zone map;
- sorted-prefix state;
- compact/merge lifecycle;
sorted_hnswindexes.
PostgreSQL’s normal partition pruning chooses which leaves participate in a
query. Once a sorted_heap leaf is selected as a base relation, the existing
SortedHeapScan planner hook should add a zone-map-pruned path for that leaf.
Current verified coverage:
- Direct leaf queries use
SortedHeapScan. - Covered parent query shape now uses
SortedHeapScanafter partition pruning. - Pure parent-level
ORDER BY embedding <=> query LIMIT kcan use PostgreSQL’s normalMerge Appendover leafsorted_hnswordered index scans. - Explicit
sorted_hnsw_partition_search(...)supports route-first vector search over all leaves or a selected leaf set, with global exact rerank over local candidate pools. It now executes the per-leaf search through thesorted_hnswIndex AM in C, requires a valid leaf-localsorted_hnswindex on every selected sorted_heap leaf, verifies that explicit selected leaves belong to the requested parent, and fails closed if that contract is not met. - Covered parent query shapes now include equality, range,
IN, literalANY(array), and generic prepared runtime-bound predicates. - Arbitrary transparent parent-dispatched ANN/GraphRAG planner support is not part of the first-pass contract.
Parent maintenance
Parent-level maintenance should be explicit and fail-closed by default.
Implemented first-pass API:
sorted_heap_compact_partitions(parent regclass, fail_on_unsupported boolean default true)
-> setof partition maintenance rows
sorted_heap_merge_partitions(parent regclass, fail_on_unsupported boolean default true)
-> setof partition maintenance rows
sorted_heap_rebuild_zonemap_partitions(parent regclass, fail_on_unsupported boolean default true)
-> setof partition maintenance rows
sorted_heap_partition_maintenance_plan(parent regclass, operation text default 'compact')
-> setof partition maintenance plan rows
Result columns:
parent_relid oid
leaf_relid oid
leaf_name text
operation_name text
status text
message text
elapsed_ms double precision
Plan result columns:
parent_relid oid
leaf_relid oid
leaf_name text
operation_name text
status text -- would_run | blocked
message text
lock_mode text
tablespace_oid oid
tablespace_name name
tablespace_location text
relation_size_bytes bigint
estimated_temp_bytes bigint
Default behavior:
- recurse through the full partition tree;
- operate only on supported leaves;
- preflight and error if any leaf is unsupported or lacks a primary key;
- return one row per processed leaf;
- never operate on foreign partitions.
Optional policy:
fail_on_unsupported=falsereports unsupported leaves asskippedand operates on supported sorted_heap leaves. This is explicit at the call site and covered by regression.
Parent observability
Implemented first-pass API:
sorted_heap_partition_status(parent regclass) -> setof sorted_heap_partition_status
Result columns:
parent_relid oid
leaf_relid oid
leaf_name text
relkind "char"
am_name name
is_sorted_heap boolean
has_primary_key boolean
zone_map_valid boolean
zone_map_sorted boolean
sorted_prefix_pages integer
zone_map_entries integer
overflow_pages integer
relation_size_bytes bigint
This function should not mutate data. It should report unsupported leaves instead of erroring, unless the input is not a partitioned table or concrete sorted_heap table.
Existing APIs
Existing concrete-relation functions keep their current semantics:
sorted_heap_compact(regclass)sorted_heap_merge(regclass)sorted_heap_rebuild_zonemap(regclass)sorted_heap_zonemap_stats(regclass)
Resolved design choice:
- The concrete-relation functions remain concrete-only.
- Partitioned parents use explicit parent helpers that return per-leaf result rows. This avoids surprising users who expect a single-relation operation and lets callers observe partial success/failure per leaf.
Acceptance Tests
P1. Leaf-level scan pruning under a partitioned parent
Create a partitioned parent with two sorted_heap leaves. Compact both leaves. Query the parent with a partition key predicate and a primary-key range predicate.
Expected:
- PostgreSQL prunes irrelevant partitions;
- selected sorted_heap leaf uses
Custom Scan (SortedHeapScan)even when the query is issued against the parent; - result count is correct.
Current status:
- Covered by
SH23-3regression for a range-partitioned parent with sorted_heap leaves, partition key equality, and primary-key range predicate. - Also covered for multi-leaf parent queries with partition-key range,
IN, and literalANY(array)predicates that reach both leaves and produce at least twoSortedHeapScanchild plans. - Generic prepared parent queries are covered with
plan_cache_mode = force_generic_planfor equality andANY($1)predicates; the plans use runtime bounds and execution returns the expected counts.
P2. Parent status reports all leaves
Create a partitioned parent with two sorted_heap leaves and compact one leaf.
Expected:
sorted_heap_partition_status(parent)returns two rows;- compacted leaf reports
zone_map_valid = true; - uncompacted leaf reports its actual state;
- relation sizes and leaf names are populated.
Current status:
- Covered by
SH23regression for both a concrete sorted_heap table and a two-leaf sorted_heap parent.
P3. Parent compact operates leaf-by-leaf
Create a partitioned parent with two sorted_heap leaves. Insert unsorted data into both leaves.
Expected:
sorted_heap_compact_partitions(parent)returns two success rows;- both leaves report valid/sorted zone maps afterward;
- parent query still returns all rows correctly.
Current status:
- Covered by
SH23-2regression.
P4. Unsupported leaf fails closed
Create a partitioned parent with one sorted_heap leaf and one heap leaf. Also cover sorted_heap leaves that lack a primary key, because concrete maintenance requires PK metadata.
Expected:
- parent status reports both leaves and identifies the heap leaf as unsupported;
- parent compact helper errors before silently skipping unsupported storage;
- sorted_heap leaves without a primary key fail during preflight, before any leaf rewrite starts.
Current status:
- Covered by
SH23-5andSH23-6regressions. The explicitfail_on_unsupported=falsemode is also covered and returns askippedrow for the heap leaf or the no-PK sorted_heap leaf.
P5. Empty partition parent
Create a partitioned parent without concrete leaf partitions.
Expected:
- status helper returns zero rows;
- maintenance helper returns zero rows;
- the storage-less parent is not treated as an unsupported concrete leaf.
Current status:
- Covered by
SH23-7regression.
P6. Nested partition tree
Create a parent with an intermediate partitioned child and concrete sorted_heap leaves below it.
Expected:
- status and maintenance helpers recurse to concrete leaves;
- intermediate partitioned nodes are not treated as storage relations.
Current status:
- Covered by
SH23-8regression. Status and compact helpers recurse through an intermediate partitioned node and operate only on concrete sorted_heap leaves.
P7. Lock behavior is documented
Run parent compact helper while a concurrent transaction holds a lock on one leaf.
Expected:
- behavior is deterministic and documented;
- no silent partial success is hidden from the result stream.
Current status:
- Lock and temporary disk-space behavior is documented in
docs/api.mdanddocs/limitations.md. - Manual smoke
make test-partition-lockverifies that a held leaf lock blockssorted_heap_compact_partitions(...)vialock_timeout, and that the same helper processes both leaves after the lock is released.
Implementation Sketch
Discovery:
- use PostgreSQL partition APIs to collect leaf OIDs from a parent;
- preserve deterministic leaf order by OID or partition bound order;
- validate each leaf relation kind and table AM.
Maintenance:
- for compact/merge/rebuild, call the existing concrete-relation internal implementation per leaf where possible;
- keep one leaf as the unit of rewrite and locking;
- return one result row per leaf.
Observability:
- reuse the existing meta-page read path from
sorted_heap_zonemap_stats; - expose row columns instead of a formatted text blob.
Planner validation:
- add SQL regression using
EXPLAIN (COSTS OFF)on a partitioned parent; - assert
SortedHeapScanappears for selected sorted_heap leaves; - accept PostgreSQL partition-member child relations in the hook path
(
RELOPT_OTHER_MEMBER_RELviaIS_SIMPLE_REL) so parent-dispatched leaf scans can receive the same zone-map-pruned path as direct leaf scans.
Resolved Decisions
- Parent helpers are
FUNCTIONs returning rows, because callers need a machine-readable per-leaf result stream. sorted_heap_partition_maintenance_plan(...)is read-only and reports all blockers instead of stopping at the first unsupported leaf. It is the operator-facing dry-run path for lock/headroom review, including portable tablespace identity for external free-space monitoring.- Helpers preflight unsupported leaves by default and fail before work starts.
Explicit
fail_on_unsupported=falsereturnsskippedrows for unsupported leaves. - Parent maintenance processes one leaf at a time instead of locking all leaves up front. This keeps temporary disk-space requirements leaf-scoped.
- Parent-query
SortedHeapScanis handled in the generic planner hook by accepting simple partition-member relations. - GraphRAG parent fanout stays explicit for
0.13: callers register concrete leaves/shards, inspect the route plan, and call the routed GraphRAG APIs. The routed wrappers perform a global merge over selected shard-local result sets; they do not treat a declarative partition parent as an implicit GraphRAG relation.
Remaining Open Questions
Resolved in first-pass hardening:
- Parent maintenance validation: the dry-run plan now collects all blockers; mutating helpers still fail fast to avoid partial work.
- Route-first HNSW safety: explicit
leaf_relidsare validated against the requested parent, and every selected sorted_heap leaf must have a valid leaf-localsorted_hnswindex. - Tablespace identity: maintenance-plan rows include portable tablespace OID, name, and location fields. Actual free-byte checks stay external because PostgreSQL has no portable SQL-level filesystem free-space metric.
- Attach/detach/default-partition lifecycle: SH23-9 covers default partition detach/attach, standalone range partition attach/detach, status traversal refresh, detached sorted_heap standalone status, and helper traversal across the dynamic leaf set.
Still open for the next phase:
- Should
sorted_hnsw_partition_search(...)move from PL/pgSQL to C to reduce route-first helper overhead on small partitions? This is now gated by Partitioned HNSW C Helper Gate. - Should a future transparent parent-dispatched GraphRAG/ANN planner path ever
exist, or should parent fanout remain helper/API-level only? The
0.13contract intentionally chooses helper/API-level routing.
Definition of Done
The first implementation is complete when:
- partition leaf scan pruning is regression-tested;
- parent status function is implemented and regression-tested;
- parent compact/merge/rebuild helpers are implemented and regression-tested;
- unsupported leaf behavior is fail-closed and tested;
- docs describe free-space and locking behavior per leaf;
- existing non-partition tests still pass.
Current completion state:
- Done: parent status, compact helper, fail-closed unsupported leaf behavior,
explicit skip mode, merge/rebuild wrappers, concrete-table status/maintenance
compatibility, nested partition traversal, covered parent-query
SortedHeapScanfor single-leaf, multi-leaf range, and generic prepared runtime-bound shapes, lock/free-space documentation, optional lock-wait smoke, upgrade-path SQL, route-first partitioned HNSW helper, and explicit routed GraphRAG parent/shard fanout policy. - Open: any future transparent parent-dispatched GraphRAG/ANN planner path as a separate spec.