Relay leads retrieval on LongMemEval — the broadest public benchmark for long-context conversational memory. Scores below, methodology, how Relay achieves them, and how to reproduce.
LongMemEval is an open benchmark from Xiao Xiao et al. at Carnegie Mellon, Adobe Research, and the University of Maryland, designed to probe whether AI memory systems can actually recall information from long multi-session conversations. It ships 500 hand-crafted questions grounded in synthesized conversation histories that span many sessions and tens of thousands of tokens of context.
Two variants matter for comparing memory systems:
Oracle — the gold-standard sessions relevant to each question are included in the corpus. This isolates how well a system retrieves from a known-good pool. A system that can't hit high recall on Oracle isn't retrieving well at all.
S-variant (Short) — the harder variant. Relevant sessions are mixed into a much larger noise pool of irrelevant chat. A system has to distinguish signal from plausible but wrong context. This is closer to real-world conditions.
Each variant is evaluated two ways: retrieval recall@k (did the right session(s) surface in the top-k results?) and end-to-end QA accuracy (given the retrieved context, did the answering model produce the right answer?). Retrieval is where memory infrastructure competes; QA is where retrieval + generation + context structure all matter.
Benchmark source: github.com/xiaowu0162/LongMemEval.
All numbers are produced by the harness in benchmarks/longmemeval/
in the open source repo. The pipeline is deliberately simple so the scores are easy to
reproduce and audit.
Dataset. The official LongMemEval corpus (500 questions per variant), pulled on demand; not redistributed in the repo.
Ingest. Each conversation session is deposited as a Relay context package
with session metadata, then embedded. We use the Xenova/all-MiniLM-L6-v2 model
(384-dim, ONNX, runs locally) so retrieval is zero-API-cost and reproducible offline.
Retrieval. Hybrid BM25 + semantic similarity via reciprocal rank fusion, followed by cross-encoder reranking. A gradient time-window gives recency weight without hard-clipping older packages. Top-5 results are fed to the QA stage.
QA generation. Claude Opus 4.6 reads the retrieved context and answers the question.
Judging. End-to-end QA accuracy is graded by an independent GPT-4o judge using the standard LongMemEval judge prompt. We do not self-grade. Retrieval recall is computed mechanically against the gold-session labels.
Cost. Retrieval is free (local embeddings). QA costs a single Claude Opus call per question. Judging costs a single GPT-4o call per question. Full eval runs for ~$8 per variant at current API prices.
Oracle isolates retrieval quality. Relay hits ceiling on recall_any@5
— every question in the set has at least one gold session in the top five.
| Metric | Score | Detail |
|---|---|---|
| recall_any@5 | 100.0% | 500 / 500 |
| recall_all@5 | 99.4% | 497 / 500 |
| End-to-end QA accuracy | 92.2% | 461 / 500 |
recall_any@k: at least one gold session in top-k. recall_all@k: all gold sessions in top-k.
S-variant mixes gold sessions into a larger noise pool. Retrieval has to distinguish plausible-but-wrong context, which tests whether the ranker is actually discriminating or just riding the semantic-similarity floor.
| Metric | Score | Detail |
|---|---|---|
| recall_any@5 | 97.0% | 485 / 500 |
| recall_all@5 | 84.6% | 423 / 500 |
| End-to-end QA accuracy | 84.8% | 424 / 500 |
Published LongMemEval results from other systems, side-by-side. Where a system publishes numbers for a metric, we cite it; where they don't, we mark not reported.
| System | Oracle R@5 | S-variant R@5 | Oracle QA | S-variant QA |
|---|---|---|---|---|
| Relay | 100.0% | 97.0% | 92.2% | 84.8% |
| MemPalace (claimed) | 96.6%* | — | — | — |
| AgentMemory | — | — | 96.2% | — |
| OMEGA | — | — | — | 95.4%† |
Four components, each one load-bearing:
BM25 + semantic embeddings fused by reciprocal rank. BM25 catches exact-keyword queries (names, codenames, IDs) that pure semantic similarity misses; semantic catches paraphrased intent that BM25 misses. Neither alone hits ceiling.
Top-K candidates from the hybrid stage go through a cross-encoder that scores query–document pairs jointly. This is where the S-variant jumps from mid-80s to 97% — the reranker can throw out plausible-but-wrong context the initial retriever accepted.
Instead of hard-clipping context to the last N packages, Relay applies a soft gradient — recent packages rank higher, older packages fade in influence but don't disappear. Long-horizon memory without forgetting everything older than a window.
Every package carries decisions, open questions, handoff notes, topic, and artifact type as queryable dimensions. These get indexed alongside the raw text and contribute their own signal to the retriever — which is how high-significance packages win against shallow keyword matches.
The benchmark harness is in benchmarks/longmemeval/ in the open source repo.
It pulls the official LongMemEval corpus, ingests it into a local Relay, runs retrieval and
QA, writes raw .jsonl result files, and prints summary metrics. Retrieval results
are in results-<variant>-k5-<run_id>.json; QA results in
qa-results-<variant>-<run_id>.jsonl.
# After cloning and building the repo $cd benchmarks/longmemeval $pnpm install $pnpm run fetch:data # download the public corpus $pnpm run bench -- --dataset oracle --topK 5 $pnpm run qa -- --dataset oracle
Swap --dataset oracle for --dataset s to run the harder variant.
Judging uses GPT-4o via the OpenAI API; set OPENAI_API_KEY before running the QA
step.
Raw result files from our runs are checked in under benchmarks/longmemeval/
so you can compare. If you re-run and get different numbers, open an issue — we'd want
to know.