pgmnemo Canonical Recall Benchmark Protocol

pgmnemo Canonical Recall Benchmark Protocol

Protocol version: 1.0.0 Frozen: 2026-05-10 Status: CANONICAL — do not modify without bumping version and updating HISTORY.md

Release notes citing a recall improvement MUST reference this document as: “pgmnemo Recall Benchmark Protocol v1.0.0 (benchmarks/PROTOCOL.md)”

1. Purpose

This document defines the one canonical procedure for measuring recall quality of pgmnemo. Any deviation from this protocol must be (a) labelled a deviation in the results artefact, and (b) logged in benchmarks/HISTORY.md before publication.

2. Registered Corpora

2.1 LongMemEval

Field	Value
Paper	Wu et al. ICLR 2025 — arXiv:2410.10813
Dataset	`xiaowu0162/longmemeval-cleaned`, file `longmemeval_s_cleaned.json`
Split	Test split only (500 items)
sha256	`d6f21ea9d60a0d56f34a05b609c79c88a451d2ae03597821ea3d5a9678c3a442`
License	See dataset repository
Download	`git clone https://github.com/xiaowu0162/LongMemEval "$LONGMEMEVAL_DATA_DIR"`
Corpus unit	One item = one multi-session conversation haystack (~47.7 sessions/item)

Query taxonomy (n=500):

Question type	N
single-session-user	70
multi-session	133
single-session-preference	30
temporal-reasoning	133
knowledge-update	78
single-session-assistant	56
Total	500

2.2 LoCoMo

Field	Value
Paper	Maharana et al. ACL 2024 — arXiv:2402.17753
Dataset	`snap-research/locomo`, file `locomo10.json`
Split	Full eval set (10 conversations, 1986 questions)
License	See dataset repository
Download	`huggingface-cli download snap-research/locomo --local-dir "$LOCOMO_DATA_DIR"`
Corpus unit	Session-level — one segment per dialog session (not per turn). See §2.2.1.

2.2.1 Session-level granularity rule (MANDATORY)

Corpus must be extracted at session granularity: one text segment per dialog session, formed by concatenating all turns within that session. This yields ~272 segments for locomo10.json (10 conversations × ~27 sessions each).

DO NOT extract at turn granularity. Turn-level extraction was a methodology bug (deprecated run locomo/results/v0.2.1_20260509/); it inflates corpus size to 5882 segments and depresses recall@10 by ~43pp vs. the paper-class result. See benchmarks/HISTORY.md (2026-05-09 entry) for the full correction record.

Evidence reference normalisation: strip the turn suffix from evidence IDs before matching (e.g. "D1:3" → "D1"). All 1982 questions with evidence must resolve to at least one corpus segment; verify 100% oracle coverage before any run.

3. Embedding Sources

3.1 Canonical embedders

Benchmark	Canonical embedder	Dimensions	Source
LongMemEval	`BAAI/bge-m3`	1024	Hugging Face
LoCoMo	`facebook/dragon-plus`	768 (zero-padded to 1024 in pgvector)	Hugging Face

LongMemEval deviation note: The Wu et al. paper uses NovaSearch/stella_en_1.5B_v5. bge-m3 is a permanent protocol-level substitution (not a per-run deviation) because Stella V5 modeling_qwen.py is incompatible with transformers ≥5.8.0. The substitution is documented in benchmarks/longmemeval/ADDENDA/LONGMEMEVAL_EMBEDDER_BGE_M3.md and in benchmarks/HISTORY.md (2026-05-09). Claims based on this protocol must disclose this substitution.

3.2 Truncation

Parameter	Value
max_seq_length	512 tokens (bge-m3 default cap)
batch_size	8 (MPS-safe)
PYTHONHASHSEED	42

4. Recall Metric Definition

4.1 Primary metrics

Metric	Definition
`recall@k`	Fraction of questions for which at least one ground-truth evidence segment appears in the top-k retrieved results. Binary per question.
`MRR`	Mean Reciprocal Rank over all questions. `1/rank` of the first relevant result; 0 if not in top-k (k=50 for MRR).

4.2 Retrieval function

SELECT *
FROM pgmnemo.recall_lessons(
    embedding       := $query_embedding,  -- float4[] dim=1024
    k               := $recall_k,         -- protocol default: 10
    query_text      := $query_text,        -- for BM25 component
    project_id      := $project_uuid
)
ORDER BY score DESC
LIMIT $recall_k;

Active scoring components: cosine similarity (HNSW) + BM25 (FTS) + recency decay + importance weight + graph proximity.

4.3 GUC state required

SET pgmnemo.gate_strict = 'warn';   -- provenance gate: warn, not block
SET pgmnemo.tenant_id   = '<bench_uuid>';
SET pgmnemo.recency_weight = 0.10;  -- protocol default (calibration result)

Record actual GUC values in metrics.json["guc_state"] per run.

5. Include / Exclude Rules for Unverified Results

A result is UNVERIFIED and MUST NOT be cited in release notes unless ALL of the following are true:

Gate	Requirement
Dataset integrity	`sha256sum <corpus_archive>` matches §2 value
Version pin	`SELECT pgmnemo.version()` matches `metrics.json["pgmnemo_version"]`
Seed recorded	`PYTHONHASHSEED=42` set; value in `metrics.json["seed"]`
Oracle coverage	LoCoMo: 100% of evidence items resolve to ≥1 corpus segment
Corpus granularity	LoCoMo: session-level extraction confirmed (segments ≈ 272, not ~5882)
Artefacts present	`metrics.json`, `report.md`, `raw_retrievals.jsonl` all committed
BLOCKED absent	No `BLOCKED.md` in the results directory

A result with a BLOCKED.md present is BLOCKED and must carry that label if referenced at all.

6. Acceptable Variance Band

Metric	Benchmark	Acceptable run-to-run variance
recall@10	LongMemEval	± 0.005 (95% CI half-width ~0.019; run variance << CI)
recall@10	LoCoMo	± 0.010
MRR	LongMemEval	± 0.010
MRR	LoCoMo	± 0.015

Variance exceeding these bands must be investigated before a result is declared canonical. Typical causes: corpus extraction granularity bug (§2.2.1), embedding model substitution, pgmnemo GUC drift, PostgreSQL planner variance on cold vs. warm HNSW index.

Baseline numbers for v0.2.1 (protocol v1.0.0):

Benchmark	recall@10	recall@10 CI 95%	MRR	MRR CI 95%
LongMemEval	0.933	(0.914, 0.952)	0.855	(0.829, 0.882)
LoCoMo	0.795	—	0.548	—

7. Canonical Run Procedure (summary)

Full step-by-step procedure with exact commands is in benchmarks/README.md §5. This section provides the canonical command sequence; README is authoritative on parameters.

# 1. Install pgmnemo at the exact tag
git clone <repo> pgmnemo && cd pgmnemo && git checkout <VERSION_TAG>
make && sudo make install

# 2. Create benchmark DB
createdb pgmnemo_bench
psql pgmnemo_bench -c "CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS pgmnemo;"

# 3. Set environment
export PYTHONHASHSEED=42
export PGMNEMO_DSN="postgresql://user:pass@host:5432/pgmnemo_bench"

# 4. LongMemEval
cd benchmarks/longmemeval && python runner.py --version <VERSION_TAG> --dry-run  # must exit 0
python runner.py --version <VERSION_TAG>

# 5. LoCoMo
cd benchmarks/locomo && bash run_locomo.sh <VERSION_TAG> results/<VERSION_TAG>_$(date +%Y%m%d)

# 6. Verify outputs — each results/ dir must contain:
#    metrics.json  report.md  raw_retrievals.jsonl
#    No BLOCKED.md present

8. Citation in Release Notes

When a release note cites a recall improvement, use this template:

Recall improvement measured per pgmnemo Recall Benchmark Protocol v1.0.0
(benchmarks/PROTOCOL.md). Corpus: [LongMemEval | LoCoMo]. Embedder: [name].
Result: recall@10 [value] (v[prev] → v[new]). Full run artefacts:
benchmarks/[bench]/results/[version_date]/

Do not cite a recall number without the protocol version reference. Do not cite a result with a BLOCKED.md marker.

9. Protocol Versioning

Version	Date	Change
1.0.0	2026-05-10	Initial frozen protocol; baseline from v0.2.1 runs

To amend this protocol:

Bump version (semver: breaking change = major, methodology addition = minor, typo = patch).
Add a row to the table above.
Add an entry to benchmarks/HISTORY.md.
Re-run both benchmarks under the new protocol and update §6 baseline numbers.
Update README.md Benchmarks section to cite the new version.

10. References

@article{wu2024longmemeval,
  title   = {{LongMemEval}: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author  = {Wu, Di and He, Hongwei and Liu, Wenhao and Han, Sanxing and
             Ma, Yuwei and He, Xiaoxin and Yang, Diyi},
  year    = {2024},
  journal = {arXiv preprint arXiv:2410.10813}
}

@article{maharana2024locomo,
  title   = {Evaluating Very Long-Term Conversational Memory of {LLM} Agents},
  author  = {Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and
             Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
  year    = {2024},
  journal = {arXiv preprint arXiv:2402.17753}
}

PGXN

PostgreSQL Extension Network

Contents