Contents
pgmnemo Scientific-Technical Release Process
Effective: 2026-05-10
Authority: Founder directive, formalized by engineering + research working group
Applies to: Every minor and major release (v0.X.Y where X or Y increments)
1. Mandate
Every release of pgmnemo must be backed by:
- Full benchmark reports on all mandatory benchmarks (§2)
- Working Group (WG) review and sign-off (§4)
- Statistical significance analysis on every claimed improvement (§3)
- A written decision document: Ship or Hold (§5)
No version tag is cut until all four are complete.
2. Benchmark Mandate
2.1 Required Benchmarks (every minor + major release)
| Benchmark | Dataset | Metric focus | Notes |
|---|---|---|---|
| LoCoMo | snap-research/locomo (locomo10.json, pinned SHA) |
recall@5/10/25/50, MRR, per-category | Run full 1982+ questions, not sampled |
| LongMemEval-S | xiaowu0162/longmemeval-cleaned (longmemeval_s_cleaned.json, pinned SHA) |
recall@1/5/10/20, MRR, per-qtype | n≥500 |
2.2 Additional Benchmarks (when applicable)
| Benchmark | Trigger |
|---|---|
| HippoRAG | If graph-based retrieval changes |
| MemoryBank | If episodic memory architecture changes |
| Custom domain eval | If new vertical-specific feature is added |
2.3 Dataset Pinning
- Every run records
dataset_sha256of the raw file - If upstream dataset changes between versions, the deviation is documented in the Methodology Changes section of the release notes
3. Statistical Reporting Requirements
3.1 All Metrics Reported
Every benchmark run reports all of the following metrics — no cherry-picking:
- recall@1, recall@5, recall@10, recall@20 (LongMemEval); recall@5, recall@10, recall@25, recall@50 (LoCoMo)
- MRR (Mean Reciprocal Rank)
- NDCG (Normalized Discounted Cumulative Gain) — where ground-truth supports it
- Per-category / per-qtype breakdowns
3.2 Confidence Intervals
- 95% Wilson confidence intervals on all proportion metrics
- Reported as
[lo, hi]alongside the point estimate - Never report a point estimate without a CI
3.3 Pairwise Significance Tests
For every metric, compare current version vs. immediately previous version:
- Two-proportion z-test (for recall@k, which are proportions):
p_pool = (k1/n1 + k2/n2) / 2
SE = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
z = (p2 - p1) / SE
p_two_tailed = 2 * (1 - Phi(|z|))
Paired t-test or Wilcoxon signed-rank for MRR (continuous values) if per-query scores are available
Report:
delta,z,p_raw,p_corrected,significant (yes/no)
3.4 Multiple Comparisons Correction
- Apply Holm-Bonferroni correction across all pairwise tests in a single report
- Report both
p_rawandp_corrected - A result is “significant” only if
p_corrected < 0.05
3.5 Effect Size
Cohen’s h for proportion differences:
h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1))Interpretation: |h| < 0.2 small, 0.2–0.5 medium, > 0.5 largeReport alongside every z-test result
3.6 Tooling
Use scripts/significance_test.py for all statistical computations. Input: two metrics.json files. Output: full comparison table with CIs, z-scores, p-values, Holm corrections, and Cohen’s h.
4. Working Group (WG) Review Gate
4.1 WG Composition
| Role | Responsibility |
|---|---|
| PI (Principal Investigator) | Final ship/hold authority |
| Chief Architect | Validity of implementation and methodology |
| StatAnalyst | Independent re-derivation of all statistical claims |
| ResSup (Research Supervisor) | Threat-to-validity assessment, benchmark integrity |
4.2 Review Process
- Author produces draft benchmark report using
benchmarks/REPORT_TEMPLATE.md scripts/significance_test.pyrun against current vs. previousmetrics.json— output appended to report- Draft circulated to WG at least 48 hours before proposed tag date
- StatAnalyst independently re-derives key statistics (different seed if simulation involved)
- Each WG member signs off in the report’s WG Sign-off section
- All four signatures required before tag is cut
4.3 Quorum Exception
If one WG member is unavailable, PI may grant a 3-of-4 quorum exception, documented in the report.
5. Decision Matrix: Ship vs. Hold
| Primary metric (recall@10) | Secondary metric (MRR) | Decision | Rationale |
|---|---|---|---|
| Significant improvement (p_corr < 0.05) | Significant improvement | SHIP | Clear win |
| Significant improvement | Non-significant / neutral | SHIP with caveat | Lead metric improved; note MRR stability |
| Non-significant | Significant improvement | CONDITIONAL SHIP | Must assess whether MRR alone justifies claim; see §5.1 |
| Non-significant | Non-significant | HOLD or SHIP as no-claim | Ship only if no performance claims made; document as “neutral” |
| Any metric regressed significantly | — | HOLD | Regression must be resolved or explicitly accepted with rationale |
5.1 Conditional Ship Criteria
A feature may ship with “Conditional” status when primary recall metric is non-significant but secondary metrics are significant, if ALL of the following hold:
- No primary metric regressed significantly
- The feature has a strong theoretical justification for MRR gain without recall gain
- The release notes accurately reflect which metrics are/are not significant
- WG unanimous agreement (no quorum exception)
5.2 Prohibited Claims
- Never claim improvement on a metric where
p_corrected ≥ 0.05 - Never report only the best-performing metric subset
- “~X pp improvement” claims require citing the specific metric, CI, and p-value
6. Public Release Notes Structure
Each release’s public-facing notes must have these exact sections:
Significant Improvements
Only metrics where p_corrected < 0.05 after Holm-Bonferroni correction.
Format: metric: +Xpp (95% CI [lo, hi], p=Y, h=Z)
Marginal / Non-Significant Changes
Metrics that changed within statistical noise.
Format: metric: +Xpp (95% CI, p=Y ns)
Regressions
Any metric that worsened, whether significant or not.
Format: metric: -Xpp (95% CI, p=Y)
Methodology Changes
Any deviation from the previous run’s methodology (embedder, dataset version, retrieval formula, etc.).
Benchmark Integrity
Dataset SHA256s, run environment, wall-clock time, device.
7. Versioning of This Process
This document is versioned alongside pgmnemo. Breaking changes to the process require: - Founder approval - A new section in this document dated and initialed - Retroactive tagging of any releases that used the prior process
Appendix A: Process Checklist (per release)
[ ] All mandatory benchmarks run on final code
[ ] metrics.json files produced with pinned dataset SHA256
[ ] significance_test.py run: current vs. previous metrics.json
[ ] Draft report written using REPORT_TEMPLATE.md
[ ] Draft shared with WG ≥48h before tag
[ ] StatAnalyst independent re-derivation complete
[ ] All 4 WG signatures collected (or documented quorum exception)
[ ] Decision matrix applied, ship/hold documented
[ ] Public release notes follow §6 structure
[ ] No prohibited claims (§5.2) in any public-facing text
[ ] Git tag cut only after all above