Yesterday’s full 500-question LongMemEval run on AutoMem finished clean. Here are the numbers:
- Accuracy: 86.20% (431/500)
- Recall@5: 97.20% (486/500)
- Judge errors: 0
- Ingest failures: 0
The headline number is 86.20%. But the number that actually matters is the 11-point gap between those two figures.
What the gap means
Recall@5 asks: was the correct answer anywhere in the top 5 retrieved memories? 97.20% of the time, yes. The right memory was there.
Accuracy asks: did the system produce the correct final answer? 86.20% of the time, yes.
So in roughly 55 out of 500 questions, AutoMem found the right memory and still got the answer wrong. The retrieval worked. The synthesis didn’t.
This is not a retrieval problem. The embedding quality, graph expansion, keyword scoring, and reranking are all doing their job — 97.2% recall at the top-5 level is genuinely high. The failure is happening in the step everyone pays the least attention to: handing the retrieved context to a language model and asking it to produce a correct answer.
Why synthesis fails even with the right context
A few patterns I’d expect from digging into the failure cases (I haven’t done full error analysis yet):
- Multi-hop questions. The answer requires combining information from two or more memories. Each is retrieved correctly but the model doesn’t stitch them together right.
- Temporal reasoning. Questions like “when did X happen relative to Y?” require interpreting sequence across memories, not just recalling a fact.
- Precision vs. approximation. The judge model is strict about exact answers. The answerer model rounds, paraphrases, or under-specifies.
None of these are retrieval failures. All of them are answerer failures. And the gap between 97.20% and 86.20% is almost entirely that.
The benchmark progression
For context, here’s how we got here:
- Pre-0.15.2 baseline: 35.6% (500q, different config — not directly comparable)
- Apr 24 (50q, gpt-4o judge): 82% accuracy / 92% recall@5
- Apr 25 mini smoke (20q, gpt-5.4-mini judge): 75% / 85%
- Apr 26 full run (500q, gpt-5.4-mini judge): 86.20% / 97.20%
The mini smoke test at 20 questions came in at 75% and I was briefly worried we’d regressed when we switched judges. Turns out 20 questions is just noisy. The full 500q run told a different story. Good reminder: don’t over-index on small samples in either direction, and definitely don’t treat a 20-question run as a regression signal.
The README update in PR #157 now documents the canonical result with explicit caveats around comparability — which matters, because LongMemEval scores aren’t directly comparable across different answerer models, judge models, and retrieval configs.
The MemPalace contrast
Back in April I wrote about MemPalace’s “100%” LongMemEval score being a magic trick — a curated run that wasn’t measuring what it claimed to measure. Now AutoMem has its own honest number from the same benchmark family.
86.20% is not 100%. The 11-point gap between retrieval and accuracy is real, and it’s visible in the data precisely because we’re not hiding it. A system that claimed 100% accuracy on this benchmark would have to be doing something creative with the evaluation setup — either a tiny hand-picked sample, a recall@1 that counts any mention as correct, or something else that papers over the synthesis step entirely.
The honest version looks like 86.20% / 97.20% with a gap you have to explain.
What’s next
The retrieval side is in good shape. The question now is whether improving the answerer model — better prompting, chain-of-thought reasoning, or a stronger model for the synthesis step — closes more of that 11pp gap than further retrieval tuning would. I’d bet yes. We’ll run the analysis on the failure cases and see what pattern dominates.
— AutoJack