pgmnemo Scientific-Technical Release Process

pgmnemo Scientific-Technical Release Process

Effective: 2026-05-10
Authority: Founder directive, formalized by engineering + research working group
Applies to: Every minor and major release (v0.X.Y where X or Y increments)

1. Mandate

Every release of pgmnemo must be backed by:

Full benchmark reports on all mandatory benchmarks (§2)
Working Group (WG) review and sign-off (§4)
Statistical significance analysis on every claimed improvement (§3)
A written decision document: Ship or Hold (§5)

No version tag is cut until all four are complete.

2. Benchmark Mandate

2.1 Required Benchmarks (every minor + major release)

Benchmark	Dataset	Metric focus	Notes
LoCoMo	`snap-research/locomo` (locomo10.json, pinned SHA)	recall@5/10/25/50, MRR, per-category	Run full 1982+ questions, not sampled
LongMemEval-S	`xiaowu0162/longmemeval-cleaned` (longmemeval_s_cleaned.json, pinned SHA)	recall@1/5/10/20, MRR, per-qtype	n≥500

2.2 Additional Benchmarks (when applicable)

Benchmark	Trigger
HippoRAG	If graph-based retrieval changes
MemoryBank	If episodic memory architecture changes
Custom domain eval	If new vertical-specific feature is added

2.3 Dataset Pinning

Every run records dataset_sha256 of the raw file
If upstream dataset changes between versions, the deviation is documented in the Methodology Changes section of the release notes

3. Statistical Reporting Requirements

3.1 All Metrics Reported

Every benchmark run reports all of the following metrics — no cherry-picking:

recall@1, recall@5, recall@10, recall@20 (LongMemEval); recall@5, recall@10, recall@25, recall@50 (LoCoMo)
MRR (Mean Reciprocal Rank)
NDCG (Normalized Discounted Cumulative Gain) — where ground-truth supports it
Per-category / per-qtype breakdowns

3.2 Confidence Intervals

95% Wilson confidence intervals on all proportion metrics
Reported as [lo, hi] alongside the point estimate
Never report a point estimate without a CI

3.3 Pairwise Significance Tests

For every metric, compare current version vs. immediately previous version:

Two-proportion z-test (for recall@k, which are proportions):

  p_pool = (k1/n1 + k2/n2) / 2
  SE = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
  z = (p2 - p1) / SE
  p_two_tailed = 2 * (1 - Phi(|z|))

Paired t-test or Wilcoxon signed-rank for MRR (continuous values) if per-query scores are available
Report: delta, z, p_raw, p_corrected, significant (yes/no)

3.4 Multiple Comparisons Correction

Apply Holm-Bonferroni correction across all pairwise tests in a single report
Report both p_raw and p_corrected
A result is “significant” only if p_corrected < 0.05

3.5 Effect Size

Cohen’s h for proportion differences: h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1)) Interpretation: |h| < 0.2 small, 0.2–0.5 medium, > 0.5 large
Report alongside every z-test result

3.6 Tooling

Use scripts/significance_test.py for all statistical computations. Input: two metrics.json files. Output: full comparison table with CIs, z-scores, p-values, Holm corrections, and Cohen’s h.

4. Working Group (WG) Review Gate

4.1 WG Composition

Role	Responsibility
PI (Principal Investigator)	Final ship/hold authority
Chief Architect	Validity of implementation and methodology
StatAnalyst	Independent re-derivation of all statistical claims
ResSup (Research Supervisor)	Threat-to-validity assessment, benchmark integrity

4.2 Review Process

Author produces draft benchmark report using benchmarks/REPORT_TEMPLATE.md
scripts/significance_test.py run against current vs. previous metrics.json — output appended to report
Draft circulated to WG at least 48 hours before proposed tag date
StatAnalyst independently re-derives key statistics (different seed if simulation involved)
Each WG member signs off in the report’s WG Sign-off section
All four signatures required before tag is cut

4.3 Quorum Exception

If one WG member is unavailable, PI may grant a 3-of-4 quorum exception, documented in the report.

5. Decision Matrix: Ship vs. Hold

Primary metric (recall@10)	Secondary metric (MRR)	Decision	Rationale
Significant improvement (p_corr < 0.05)	Significant improvement	SHIP	Clear win
Significant improvement	Non-significant / neutral	SHIP with caveat	Lead metric improved; note MRR stability
Non-significant	Significant improvement	CONDITIONAL SHIP	Must assess whether MRR alone justifies claim; see §5.1
Non-significant	Non-significant	HOLD or SHIP as no-claim	Ship only if no performance claims made; document as “neutral”
Any metric regressed significantly	—	HOLD	Regression must be resolved or explicitly accepted with rationale

5.1 Conditional Ship Criteria

A feature may ship with “Conditional” status when primary recall metric is non-significant but secondary metrics are significant, if ALL of the following hold:

No primary metric regressed significantly
The feature has a strong theoretical justification for MRR gain without recall gain
The release notes accurately reflect which metrics are/are not significant
WG unanimous agreement (no quorum exception)

5.2 Prohibited Claims

Never claim improvement on a metric where p_corrected ≥ 0.05
Never report only the best-performing metric subset
“~X pp improvement” claims require citing the specific metric, CI, and p-value

6. Public Release Notes Structure

Each release’s public-facing notes must have these exact sections:

Significant Improvements

Only metrics where p_corrected < 0.05 after Holm-Bonferroni correction.
Format: metric: +Xpp (95% CI [lo, hi], p=Y, h=Z)

Marginal / Non-Significant Changes

Metrics that changed within statistical noise.
Format: metric: +Xpp (95% CI, p=Y ns)

Regressions

Any metric that worsened, whether significant or not.
Format: metric: -Xpp (95% CI, p=Y)

Methodology Changes

Any deviation from the previous run’s methodology (embedder, dataset version, retrieval formula, etc.).

Benchmark Integrity

Dataset SHA256s, run environment, wall-clock time, device.

7. Versioning of This Process

This document is versioned alongside pgmnemo. Breaking changes to the process require: - Founder approval - A new section in this document dated and initialed - Retroactive tagging of any releases that used the prior process

Appendix A: Process Checklist (per release)

[ ] All mandatory benchmarks run on final code
[ ] metrics.json files produced with pinned dataset SHA256
[ ] significance_test.py run: current vs. previous metrics.json
[ ] Draft report written using REPORT_TEMPLATE.md
[ ] Draft shared with WG ≥48h before tag
[ ] StatAnalyst independent re-derivation complete
[ ] All 4 WG signatures collected (or documented quorum exception)
[ ] Decision matrix applied, ship/hold documented
[ ] Public release notes follow §6 structure
[ ] No prohibited claims (§5.2) in any public-facing text
[ ] Git tag cut only after all above

PGXN

PostgreSQL Extension Network

Contents