Benchmarks

§ A

What is LongMemEval?

LongMemEval is an open benchmark from Xiao Xiao et al. at Carnegie Mellon, Adobe Research, and the University of Maryland, designed to probe whether AI memory systems can actually recall information from long multi-session conversations. It ships 500 hand-crafted questions grounded in synthesized conversation histories that span many sessions and tens of thousands of tokens of context.

Two variants matter for comparing memory systems:

Oracle — the gold-standard sessions relevant to each question are included in the corpus. This isolates how well a system retrieves from a known-good pool. A system that can't hit high recall on Oracle isn't retrieving well at all.

S-variant (Short) — the harder variant. Relevant sessions are mixed into a much larger noise pool of irrelevant chat. A system has to distinguish signal from plausible but wrong context. This is closer to real-world conditions.

Each variant is evaluated two ways: retrieval recall@k (did the right session(s) surface in the top-k results?) and end-to-end QA accuracy (given the retrieved context, did the answering model produce the right answer?). Retrieval is where memory infrastructure competes; QA is where retrieval + generation + context structure all matter.

Benchmark source: github.com/xiaowu0162/LongMemEval.

§ B

Methodology

All numbers are produced by the harness in benchmarks/longmemeval/ in the open source repo. The pipeline is deliberately simple so the scores are easy to reproduce and audit.

Dataset. The official LongMemEval corpus (500 questions per variant), pulled on demand; not redistributed in the repo.

Ingest. Each conversation session is deposited as a Relay context package with session metadata, then embedded. We use the Xenova/all-MiniLM-L6-v2 model (384-dim, ONNX, runs locally) so retrieval is zero-API-cost and reproducible offline.

Retrieval. Hybrid BM25 + semantic similarity via reciprocal rank fusion, followed by cross-encoder reranking. A gradient time-window gives recency weight without hard-clipping older packages. Top-5 results are fed to the QA stage.

QA generation. Claude Opus 4.6 reads the retrieved context and answers the question.

Judging. End-to-end QA accuracy is graded by an independent GPT-4o judge using the standard LongMemEval judge prompt. We do not self-grade. Retrieval recall is computed mechanically against the gold-session labels.

Cost. Retrieval is free (local embeddings). QA costs a single Claude Opus call per question. Judging costs a single GPT-4o call per question. Full eval runs for ~$8 per variant at current API prices.

§ C

Oracle variant — full results

Oracle isolates retrieval quality. Relay hits ceiling on recall_any@5 — every question in the set has at least one gold session in the top five.

Metric	Score	Detail
recall_any@5	100.0%	500 / 500
recall_all@5	99.4%	497 / 500
End-to-end QA accuracy	92.2%	461 / 500

recall_any@k: at least one gold session in top-k. recall_all@k: all gold sessions in top-k.

§ D

S-variant — full results

S-variant mixes gold sessions into a larger noise pool. Retrieval has to distinguish plausible-but-wrong context, which tests whether the ranker is actually discriminating or just riding the semantic-similarity floor.

Metric	Score	Detail
recall_any@5	97.0%	485 / 500
recall_all@5	84.6%	423 / 500
End-to-end QA accuracy	84.8%	424 / 500

§ E

How Relay compares

Published LongMemEval results from other systems, side-by-side. Where a system publishes numbers for a metric, we cite it; where they don't, we mark not reported.

System	Oracle R@5	S-variant R@5	Oracle QA	S-variant QA
Relay	100.0%	97.0%	92.2%	84.8%
MemPalace (claimed)	96.6%*	—	—	—
AgentMemory	—	—	96.2%	—
OMEGA	—	—	—	95.4%†

Notes on comparisons * MemPalace's 96.6% is disputed — see Issue #29 for the audit details. † OMEGA's S-variant QA is self-graded and runs single-machine local only. AgentMemory's Oracle QA uses a similar Claude Opus pipeline to ours.

Positioning Relay is not trying to be a better memory library. It's a coordination protocol that happens to hit best-in-class retrieval as a side effect of treating context packages as first-class, queryable artifacts. Multi-agent, multi-vendor, cloud-coordinated — not a single-user, single-machine memory blob.

§ F

How Relay gets these scores

Four components, each one load-bearing:

Hybrid Retrieval

BM25 + semantic embeddings fused by reciprocal rank. BM25 catches exact-keyword queries (names, codenames, IDs) that pure semantic similarity misses; semantic catches paraphrased intent that BM25 misses. Neither alone hits ceiling.
Cross-Encoder Reranker

Top-K candidates from the hybrid stage go through a cross-encoder that scores query–document pairs jointly. This is where the S-variant jumps from mid-80s to 97% — the reranker can throw out plausible-but-wrong context the initial retriever accepted.
Gradient Time-Windowing

Instead of hard-clipping context to the last N packages, Relay applies a soft gradient — recent packages rank higher, older packages fade in influence but don't disappear. Long-horizon memory without forgetting everything older than a window.
Structured Metadata

Every package carries decisions, open questions, handoff notes, topic, and artifact type as queryable dimensions. These get indexed alongside the raw text and contribute their own signal to the retriever — which is how high-significance packages win against shallow keyword matches.

§ G

Reproduce these numbers

The benchmark harness is in benchmarks/longmemeval/ in the open source repo. It pulls the official LongMemEval corpus, ingests it into a local Relay, runs retrieval and QA, writes raw .jsonl result files, and prints summary metrics. Retrieval results are in results-<variant>-k5-<run_id>.json; QA results in qa-results-<variant>-<run_id>.jsonl.

# After cloning and building the repo
$cd benchmarks/longmemeval
$pnpm install
$pnpm run fetch:data       # download the public corpus
$pnpm run bench -- --dataset oracle --topK 5
$pnpm run qa    -- --dataset oracle

Swap --dataset oracle for --dataset s to run the harder variant. Judging uses GPT-4o via the OpenAI API; set OPENAI_API_KEY before running the QA step.

Raw result files from our runs are checked in under benchmarks/longmemeval/ so you can compare. If you re-run and get different numbers, open an issue — we'd want to know.

What is LongMemEval?

Methodology

Oracle variant — full results

S-variant — full results

How Relay compares

How Relay gets these scores

Hybrid Retrieval

Cross-Encoder Reranker

Gradient Time-Windowing

Structured Metadata

Reproduce these numbers