pgmnemo Canonical Recall Benchmark Protocol

Protocol version: 1.0.0 Frozen: 2026-05-10 Status: CANONICAL — do not modify without bumping version and updating HISTORY.md

Release notes citing a recall improvement MUST reference this document as: “pgmnemo Recall Benchmark Protocol v1.0.0 (benchmarks/PROTOCOL.md)”


1. Purpose

This document defines the one canonical procedure for measuring recall quality of pgmnemo. Any deviation from this protocol must be (a) labelled a deviation in the results artefact, and (b) logged in benchmarks/HISTORY.md before publication.


2. Registered Corpora

2.1 LongMemEval

Field Value
Paper Wu et al. ICLR 2025 — arXiv:2410.10813
Dataset xiaowu0162/longmemeval-cleaned, file longmemeval_s_cleaned.json
Split Test split only (500 items)
sha256 d6f21ea9d60a0d56f34a05b609c79c88a451d2ae03597821ea3d5a9678c3a442
License See dataset repository
Download git clone https://github.com/xiaowu0162/LongMemEval "$LONGMEMEVAL_DATA_DIR"
Corpus unit One item = one multi-session conversation haystack (~47.7 sessions/item)

Query taxonomy (n=500):

Question type N
single-session-user 70
multi-session 133
single-session-preference 30
temporal-reasoning 133
knowledge-update 78
single-session-assistant 56
Total 500

2.2 LoCoMo

Field Value
Paper Maharana et al. ACL 2024 — arXiv:2402.17753
Dataset snap-research/locomo, file locomo10.json
Split Full eval set (10 conversations, 1986 questions)
License See dataset repository
Download huggingface-cli download snap-research/locomo --local-dir "$LOCOMO_DATA_DIR"
Corpus unit Session-level — one segment per dialog session (not per turn). See §2.2.1.

2.2.1 Session-level granularity rule (MANDATORY)

Corpus must be extracted at session granularity: one text segment per dialog session, formed by concatenating all turns within that session. This yields ~272 segments for locomo10.json (10 conversations × ~27 sessions each).

DO NOT extract at turn granularity. Turn-level extraction was a methodology bug (deprecated run locomo/results/v0.2.1_20260509/); it inflates corpus size to 5882 segments and depresses recall@10 by ~43pp vs. the paper-class result. See benchmarks/HISTORY.md (2026-05-09 entry) for the full correction record.

Evidence reference normalisation: strip the turn suffix from evidence IDs before matching (e.g. "D1:3""D1"). All 1982 questions with evidence must resolve to at least one corpus segment; verify 100% oracle coverage before any run.


3. Embedding Sources

3.1 Canonical embedders

Benchmark Canonical embedder Dimensions Source
LongMemEval BAAI/bge-m3 1024 Hugging Face
LoCoMo facebook/dragon-plus 768 (zero-padded to 1024 in pgvector) Hugging Face

LongMemEval deviation note: The Wu et al. paper uses NovaSearch/stella_en_1.5B_v5. bge-m3 is a permanent protocol-level substitution (not a per-run deviation) because Stella V5 modeling_qwen.py is incompatible with transformers ≥5.8.0. The substitution is documented in benchmarks/longmemeval/ADDENDA/LONGMEMEVAL_EMBEDDER_BGE_M3.md and in benchmarks/HISTORY.md (2026-05-09). Claims based on this protocol must disclose this substitution.

3.2 Truncation

Parameter Value
max_seq_length 512 tokens (bge-m3 default cap)
batch_size 8 (MPS-safe)
PYTHONHASHSEED 42

4. Recall Metric Definition

4.1 Primary metrics

Metric Definition
recall@k Fraction of questions for which at least one ground-truth evidence segment appears in the top-k retrieved results. Binary per question.
MRR Mean Reciprocal Rank over all questions. 1/rank of the first relevant result; 0 if not in top-k (k=50 for MRR).

4.2 Retrieval function

SELECT *
FROM pgmnemo.recall_lessons(
    embedding       := $query_embedding,  -- float4[] dim=1024
    k               := $recall_k,         -- protocol default: 10
    query_text      := $query_text,        -- for BM25 component
    project_id      := $project_uuid
)
ORDER BY score DESC
LIMIT $recall_k;

Active scoring components: cosine similarity (HNSW) + BM25 (FTS) + recency decay + importance weight + graph proximity.

4.3 GUC state required

SET pgmnemo.gate_strict = 'warn';   -- provenance gate: warn, not block
SET pgmnemo.tenant_id   = '<bench_uuid>';
SET pgmnemo.recency_weight = 0.10;  -- protocol default (calibration result)

Record actual GUC values in metrics.json["guc_state"] per run.


5. Include / Exclude Rules for Unverified Results

A result is UNVERIFIED and MUST NOT be cited in release notes unless ALL of the following are true:

Gate Requirement
Dataset integrity sha256sum <corpus_archive> matches §2 value
Version pin SELECT pgmnemo.version() matches metrics.json["pgmnemo_version"]
Seed recorded PYTHONHASHSEED=42 set; value in metrics.json["seed"]
Oracle coverage LoCoMo: 100% of evidence items resolve to ≥1 corpus segment
Corpus granularity LoCoMo: session-level extraction confirmed (segments ≈ 272, not ~5882)
Artefacts present metrics.json, report.md, raw_retrievals.jsonl all committed
BLOCKED absent No BLOCKED.md in the results directory

A result with a BLOCKED.md present is BLOCKED and must carry that label if referenced at all.


6. Acceptable Variance Band

Metric Benchmark Acceptable run-to-run variance
recall@10 LongMemEval ± 0.005 (95% CI half-width ~0.019; run variance << CI)
recall@10 LoCoMo ± 0.010
MRR LongMemEval ± 0.010
MRR LoCoMo ± 0.015

Variance exceeding these bands must be investigated before a result is declared canonical. Typical causes: corpus extraction granularity bug (§2.2.1), embedding model substitution, pgmnemo GUC drift, PostgreSQL planner variance on cold vs. warm HNSW index.

Baseline numbers for v0.2.1 (protocol v1.0.0):

Benchmark recall@10 recall@10 CI 95% MRR MRR CI 95%
LongMemEval 0.933 (0.914, 0.952) 0.855 (0.829, 0.882)
LoCoMo 0.795 0.548

7. Canonical Run Procedure (summary)

Full step-by-step procedure with exact commands is in benchmarks/README.md §5. This section provides the canonical command sequence; README is authoritative on parameters.

# 1. Install pgmnemo at the exact tag
git clone <repo> pgmnemo && cd pgmnemo && git checkout <VERSION_TAG>
make && sudo make install

# 2. Create benchmark DB
createdb pgmnemo_bench
psql pgmnemo_bench -c "CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS pgmnemo;"

# 3. Set environment
export PYTHONHASHSEED=42
export PGMNEMO_DSN="postgresql://user:pass@host:5432/pgmnemo_bench"

# 4. LongMemEval
cd benchmarks/longmemeval && python runner.py --version <VERSION_TAG> --dry-run  # must exit 0
python runner.py --version <VERSION_TAG>

# 5. LoCoMo
cd benchmarks/locomo && bash run_locomo.sh <VERSION_TAG> results/<VERSION_TAG>_$(date +%Y%m%d)

# 6. Verify outputs — each results/ dir must contain:
#    metrics.json  report.md  raw_retrievals.jsonl
#    No BLOCKED.md present

8. Citation in Release Notes

When a release note cites a recall improvement, use this template:

Recall improvement measured per pgmnemo Recall Benchmark Protocol v1.0.0
(benchmarks/PROTOCOL.md). Corpus: [LongMemEval | LoCoMo]. Embedder: [name].
Result: recall@10 [value] (v[prev] → v[new]). Full run artefacts:
benchmarks/[bench]/results/[version_date]/

Do not cite a recall number without the protocol version reference. Do not cite a result with a BLOCKED.md marker.


9. Protocol Versioning

Version Date Change
1.0.0 2026-05-10 Initial frozen protocol; baseline from v0.2.1 runs

To amend this protocol:

  1. Bump version (semver: breaking change = major, methodology addition = minor, typo = patch).
  2. Add a row to the table above.
  3. Add an entry to benchmarks/HISTORY.md.
  4. Re-run both benchmarks under the new protocol and update §6 baseline numbers.
  5. Update README.md Benchmarks section to cite the new version.

10. References

@article{wu2024longmemeval,
  title   = {{LongMemEval}: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author  = {Wu, Di and He, Hongwei and Liu, Wenhao and Han, Sanxing and
             Ma, Yuwei and He, Xiaoxin and Yang, Diyi},
  year    = {2024},
  journal = {arXiv preprint arXiv:2410.10813}
}

@article{maharana2024locomo,
  title   = {Evaluating Very Long-Term Conversational Memory of {LLM} Agents},
  author  = {Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and
             Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
  year    = {2024},
  journal = {arXiv preprint arXiv:2402.17753}
}