layout: default title: Huge-Table Compaction

nav_order: 16

Spec: Huge-Table Compaction Operating Model

Status: implemented first pass Risk tier: CAUTION Primary goal: make the disk-space and locking contract explicit for large sorted_heap deployments.

Problem

sorted_heap gets its read performance from physical locality plus zone maps. Eventually, write-heavy tables need maintenance to restore sorted order and fresh pruning metadata. The maintenance verbs are intentionally rewrite-based:

sorted_heap_compact(regclass) rewrites the relation in globally sorted PK order and rebuilds the zone map.
sorted_heap_merge(regclass) detects the already-sorted prefix, sorts the unsorted tail, and writes a new relation.
online variants reduce blocking time with trigger-based change capture, but they still build a replacement relation before the final swap.

That means compaction is not an in-place defragmenter. Operators need temporary disk headroom for the rewrite unit.

Current Contracts

Concrete table operations

Operation	Rewrite unit	Main lock shape	Disk headroom
`sorted_heap_compact`	whole relation	`AccessExclusiveLock` for operation	replacement relation + rebuilt indexes
`sorted_heap_merge`	whole relation	`AccessExclusiveLock` for operation	replacement relation + rebuilt indexes
`sorted_heap_compact_online`	whole relation	long copy under weaker lock, brief final `AccessExclusiveLock`	replacement relation + log table + rebuilt indexes
`sorted_heap_merge_online`	whole relation	same online shape as compact	replacement relation + log table + rebuilt indexes

sorted_heap_merge may be faster than full compact when the sorted prefix is large, but it still produces a new relation. It is a CPU/sort optimization, not a disk-headroom optimization.

Partitioned parent operations

Partition helpers make the leaf the rewrite unit:

SELECT *
FROM sorted_heap_compact_partitions('events_parent'::regclass);

SELECT *
FROM sorted_heap_merge_partitions('events_parent'::regclass);

Contract:

recurse through concrete leaves;
preflight unsupported leaves and sorted_heap leaves without primary keys;
process supported leaves one at a time;
return one result row per processed or skipped leaf;
reduce temporary disk headroom from “whole logical parent” to “current leaf”;
do not provide one global all-leaves transaction.

If a later leaf fails, earlier leaves remain processed. This is intentional: the helper is an operational wrapper over concrete relation maintenance, not a distributed transaction protocol.

Recommended Huge-Table Strategy

Use declarative partitioning to bound the maintenance unit.

Typical partition keys:

time range for event/time-series tables;
tenant / knowledge-base / shard id for multi-tenant retrieval;
coarse lifecycle segment for hot/cold data.

Recommended flow:

Keep each leaf at a size where one rewrite fits operational headroom.
Bulk load or ingest into leaves.
Run sorted_heap_compact_partitions(...) or sorted_heap_merge_partitions(...) during maintenance windows.
Before the run, inspect sorted_heap_partition_maintenance_plan(parent, 'compact') or sorted_heap_partition_maintenance_plan(parent, 'merge') to see all blockers, the per-leaf rewrite headroom estimate, and the tablespace that should be checked by external free-space monitoring.
Inspect state with sorted_heap_partition_status(parent).
For live systems, prefer online concrete operations on the hot leaf when blocking time matters more than total runtime.

Sizing Heuristic

For a concrete leaf, reserve enough free space for:

replacement heap relation
+ rebuilt secondary indexes
+ transient online log table, for online variants
+ normal PostgreSQL WAL/checkpoint headroom

This spec does not give a universal percentage because row width, index count, TOAST use, WAL settings, and filesystem behavior dominate. The actionable rule is simpler: make partitions small enough that one leaf rewrite is safe.

Non-Goals

No promise of in-place compaction.
No segment-level rewrite inside one relation yet.
No global all-partition atomic maintenance.
No automatic repartitioning or partition-size advisor.

Future Segment-Level Compaction

Segment-level compaction could reduce disk headroom inside a single large relation, but it needs a separate storage design:

crash-safe mapping from old segment pages to new segment pages;
index TID update or indirection semantics;
zone-map and sorted-prefix updates that remain valid across partial rewrite;
WAL/recovery story for interrupted segment swaps;
planner/executor behavior while old and new segments coexist.

Until those are specified and tested, partition-scoped rewrite is the supported large-table operating model.

Acceptance Tests

Already covered:

SH23 verifies partition parent helpers, nested leaves, unsupported leaves, and no-PK leaves.
make test-partition-lock verifies a held leaf lock blocks parent compact and that the helper succeeds after release.
Existing dump/restore, TOAST, crash, and concurrent online tests cover the underlying concrete rewrite paths.

Future tests:

benchmark partitioned maintenance over multiple differently sized leaves;
verify failure reporting when one later leaf cannot acquire lock;
continue validating the dry-run estimate against operator feedback; the SQL plan now reports tablespace identity, while actual free bytes remain an external filesystem/platform metric.
attach/detach/default-partition lifecycle regression is covered by SH23-9.

Quadrumvirate Notes

Cassandra:

Likely failure mode: users infer “compact” means in-place. The docs must say rewrite-and-swap plainly.
Likely failure mode: online compact is mistaken for lower disk usage. It lowers blocking, not rewrite headroom.

Daedalus:

Reframe from “how do we compact a huge table” to “how do we bound the rewrite unit”. Partitioning is the current answer.

Maieutic:

Assumption: one logical table must be one physical rewrite unit. Refuted by PostgreSQL declarative partitioning and leaf-scoped helpers.
Assumption: segment-level compaction is a small extension of merge. It is not; it changes crash recovery and index/TID semantics.

Adversary:

Any future in-place/segment claim must specify index updates and crash recovery before implementation.
Any operator-facing helper must expose partial failures explicitly.

PGXN

PostgreSQL Extension Network

Contents