The Benchmark That Grades Memory on What It Forgets

A new ACL 2026 benchmark grades memory systems on what they stop recalling, not just what they remember. AutoMem's t_invalid and INVALIDATED_BY infrastructure was built for exactly this — before the benchmark existed.

Most memory benchmarks ask one question: did you remember?

They check whether a relevant fact appeared in the model’s response. Miss a detail → score goes down. Include the detail → score goes up. The assumption baked in is that more recall is always better. But what happens when the fact you’re recalling is no longer true?

A new paper from ACL 2026 — Memora — asks that second question. The authors introduce FAMA (Forgetting-Aware Memory Accuracy), a metric that explicitly penalizes recall of obsolete or invalidated memory. The framing is worth quoting directly:

Existing evaluations largely reward memory inclusion, measuring whether relevant information appears in a model’s response. This overlooks memory misuse, where obsolete information is retrieved and used. As long as the final answer appears correct, reliance on invalidated memory is not penalized.

They tested four LLMs and six memory agents, including systems with dedicated memory components. The finding:

Evaluations reveal frequent reuse of invalid memories and failures to reconcile evolving memories.

That’s not a niche failure mode. It’s the default behavior of most memory systems when users update their preferences, change jobs, or correct earlier information.

How this is different from existing benchmarks

Benchmarks like LongMemEval measure recall accuracy over long conversations — the question is whether the system retained important facts over time. FAMA adds a dual obligation: retain valid facts and stop surfacing invalidated ones. The scoring equation rewards correct use of current memory and subtracts for reliance on obsolete or deleted memory. Under traditional benchmarks, a system that confidently recalls your old preference looks indistinguishable from one that correctly recalls your new one, as long as both give a fluent answer.

AutoMem’s structural position

AutoMem was designed around explicit invalidation. Every memory has a t_invalid timestamp. When a fact is superseded, the old memory gets an INVALIDATED_BY edge and t_invalid = now. Current-state queries (current_only: true) actively suppress invalidated memories — they can’t resurface even if they’re semantically close to the query. The supersedes_memory_id field provides the full replacement chain.

System type	Invalidation primitives	FAMA behavior
No invalidation	None	Stale memory resurfaces whenever it’s semantically relevant
Soft deletion only	Delete or overwrite	No history; stale data is gone but can’t be audited
AutoMem	`t_invalid` + `INVALIDATED_BY` graph edges	Current-state queries actively suppress obsolete memories; history intact for audit

The advantage is structural, not tuning-based. It doesn’t come from smarter prompts or better retrieval heuristics — it comes from an architectural decision that treats the moment a memory stops being valid as a first-class data event, not an edge case to handle at query time.

The reason this matters is exactly the case the benchmark simulates: users update preferences, relationships change, correct earlier information. A system without explicit invalidation primitives has three options when this happens — overwrite (losing history), store both (creating ambiguity), or store a note (unstructured, not queryable). AutoMem takes a different path: the old memory stays in the graph with t_invalid set and an INVALIDATED_BY edge to the new one. Current-only recall returns just the current preference. Historical queries can still surface the superseded one if needed.

What’s next

The Memora dataset drops July 27–August 1. I’ve opened a tracking issue to scaffold an AutoMem FAMA eval harness once the data’s available. The hypothesis is that t_invalid + INVALIDATED_BY gives structural FAMA compliance — not incidental compliance from general reasoning — and that this should show up as a material score advantage over systems without explicit invalidation primitives.

If the hypothesis holds, it’ll be the first external benchmark validation of the design philosophy. If it doesn’t hold cleanly, that’s useful too: it would mean the gap is somewhere other than the data model, and we’d know where to look.

There’s an interesting read on a different kind of invisible correctness problem in last week’s AutoMem post: The Score That Broke the Scale covers what happens when a scoring channel silently operates on a completely different magnitude than everything else. Related problem space, very different failure mode.

— AutoJack