pgmnemo Scientific-Technical Release Process

Effective: 2026-05-10
Authority: Founder directive, formalized by engineering + research working group
Applies to: Every minor and major release (v0.X.Y where X or Y increments)


1. Mandate

Every release of pgmnemo must be backed by:

  1. Full benchmark reports on all mandatory benchmarks (§2)
  2. Working Group (WG) review and sign-off (§4)
  3. Statistical significance analysis on every claimed improvement (§3)
  4. A written decision document: Ship or Hold (§5)

No version tag is cut until all four are complete.


2. Benchmark Mandate

2.1 Required Benchmarks (every minor + major release)

Benchmark Dataset Metric focus Notes
LoCoMo snap-research/locomo (locomo10.json, pinned SHA) recall@5/10/25/50, MRR, per-category Run full 1982+ questions, not sampled
LongMemEval-S xiaowu0162/longmemeval-cleaned (longmemeval_s_cleaned.json, pinned SHA) recall@1/5/10/20, MRR, per-qtype n≥500

2.2 Additional Benchmarks (when applicable)

Benchmark Trigger
HippoRAG If graph-based retrieval changes
MemoryBank If episodic memory architecture changes
Custom domain eval If new vertical-specific feature is added

2.3 Dataset Pinning

  • Every run records dataset_sha256 of the raw file
  • If upstream dataset changes between versions, the deviation is documented in the Methodology Changes section of the release notes

3. Statistical Reporting Requirements

3.1 All Metrics Reported

Every benchmark run reports all of the following metrics — no cherry-picking:

  • recall@1, recall@5, recall@10, recall@20 (LongMemEval); recall@5, recall@10, recall@25, recall@50 (LoCoMo)
  • MRR (Mean Reciprocal Rank)
  • NDCG (Normalized Discounted Cumulative Gain) — where ground-truth supports it
  • Per-category / per-qtype breakdowns

3.2 Confidence Intervals

  • 95% Wilson confidence intervals on all proportion metrics
  • Reported as [lo, hi] alongside the point estimate
  • Never report a point estimate without a CI

3.3 Pairwise Significance Tests

For every metric, compare current version vs. immediately previous version:

  • Two-proportion z-test (for recall@k, which are proportions):
  p_pool = (k1/n1 + k2/n2) / 2
  SE = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
  z = (p2 - p1) / SE
  p_two_tailed = 2 * (1 - Phi(|z|))
  • Paired t-test or Wilcoxon signed-rank for MRR (continuous values) if per-query scores are available

  • Report: delta, z, p_raw, p_corrected, significant (yes/no)

3.4 Multiple Comparisons Correction

  • Apply Holm-Bonferroni correction across all pairwise tests in a single report
  • Report both p_raw and p_corrected
  • A result is “significant” only if p_corrected < 0.05

3.5 Effect Size

  • Cohen’s h for proportion differences: h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1)) Interpretation: |h| < 0.2 small, 0.2–0.5 medium, > 0.5 large

  • Report alongside every z-test result

3.6 Tooling

Use scripts/significance_test.py for all statistical computations. Input: two metrics.json files. Output: full comparison table with CIs, z-scores, p-values, Holm corrections, and Cohen’s h.


4. Working Group (WG) Review Gate

4.1 WG Composition

Role Responsibility
PI (Principal Investigator) Final ship/hold authority
Chief Architect Validity of implementation and methodology
StatAnalyst Independent re-derivation of all statistical claims
ResSup (Research Supervisor) Threat-to-validity assessment, benchmark integrity

4.2 Review Process

  1. Author produces draft benchmark report using benchmarks/REPORT_TEMPLATE.md
  2. scripts/significance_test.py run against current vs. previous metrics.json — output appended to report
  3. Draft circulated to WG at least 48 hours before proposed tag date
  4. StatAnalyst independently re-derives key statistics (different seed if simulation involved)
  5. Each WG member signs off in the report’s WG Sign-off section
  6. All four signatures required before tag is cut

4.3 Quorum Exception

If one WG member is unavailable, PI may grant a 3-of-4 quorum exception, documented in the report.


5. Decision Matrix: Ship vs. Hold

Primary metric (recall@10) Secondary metric (MRR) Decision Rationale
Significant improvement (p_corr < 0.05) Significant improvement SHIP Clear win
Significant improvement Non-significant / neutral SHIP with caveat Lead metric improved; note MRR stability
Non-significant Significant improvement CONDITIONAL SHIP Must assess whether MRR alone justifies claim; see §5.1
Non-significant Non-significant HOLD or SHIP as no-claim Ship only if no performance claims made; document as “neutral”
Any metric regressed significantly HOLD Regression must be resolved or explicitly accepted with rationale

5.1 Conditional Ship Criteria

A feature may ship with “Conditional” status when primary recall metric is non-significant but secondary metrics are significant, if ALL of the following hold:

  1. No primary metric regressed significantly
  2. The feature has a strong theoretical justification for MRR gain without recall gain
  3. The release notes accurately reflect which metrics are/are not significant
  4. WG unanimous agreement (no quorum exception)

5.2 Prohibited Claims

  • Never claim improvement on a metric where p_corrected ≥ 0.05
  • Never report only the best-performing metric subset
  • “~X pp improvement” claims require citing the specific metric, CI, and p-value

6. Public Release Notes Structure

Each release’s public-facing notes must have these exact sections:

Significant Improvements

Only metrics where p_corrected < 0.05 after Holm-Bonferroni correction.
Format: metric: +Xpp (95% CI [lo, hi], p=Y, h=Z)

Marginal / Non-Significant Changes

Metrics that changed within statistical noise.
Format: metric: +Xpp (95% CI, p=Y ns)

Regressions

Any metric that worsened, whether significant or not.
Format: metric: -Xpp (95% CI, p=Y)

Methodology Changes

Any deviation from the previous run’s methodology (embedder, dataset version, retrieval formula, etc.).

Benchmark Integrity

Dataset SHA256s, run environment, wall-clock time, device.


7. Versioning of This Process

This document is versioned alongside pgmnemo. Breaking changes to the process require: - Founder approval - A new section in this document dated and initialed - Retroactive tagging of any releases that used the prior process


Appendix A: Process Checklist (per release)

[ ] All mandatory benchmarks run on final code
[ ] metrics.json files produced with pinned dataset SHA256
[ ] significance_test.py run: current vs. previous metrics.json
[ ] Draft report written using REPORT_TEMPLATE.md
[ ] Draft shared with WG ≥48h before tag
[ ] StatAnalyst independent re-derivation complete
[ ] All 4 WG signatures collected (or documented quorum exception)
[ ] Decision matrix applied, ship/hold documented
[ ] Public release notes follow §6 structure
[ ] No prohibited claims (§5.2) in any public-facing text
[ ] Git tag cut only after all above