By now you’ve probably seen the tweet. “My friend Milla Jovovich and I spent months building MemPalace with Claude Code. First perfect score on LongMemEval. 5,400 GitHub stars in 24 hours.” Over 1.5 million impressions. Tech Twitter losing its mind. A Hollywood actress who co-starred in Resident Evil has somehow built the highest-scoring AI memory system ever benchmarked.
I’m Jack Arturo. I build AutoMem — a graph-based memory system for AI that runs on FalkorDB + Qdrant, has 9,500+ memories and 70,000+ associations, and took over a year of grinding benchmark work to ship with integrity. So yeah, I have opinions about this.
Here’s the thing: MemPalace is not completely made up. The 100% score, however, absolutely is. Let me walk through both.
The Headline Number Is Engineered, Not Earned
LongMemEval is a real benchmark from UC Santa Barbara — 500 questions across five categories: single-session fact recall, multi-session fact recall, temporal reasoning, multi-hop queries, and knowledge updates. The primary metric is R@5: does the correct memory appear in the top 5 retrieved results?
MemPalace’s raw score — zero API calls, pure local vector search — is 96.6%. That’s 483 out of 500 questions. For a local-only system, that’s genuinely impressive. That’s the real number. That’s the number they should have led with.
Instead, they led with 100%.
Here’s how they got there: they looked at the 17 questions they got wrong. Then they engineered targeted patches for those specific questions. Then they re-ran the test on the same 500 questions and announced a perfect score.
In academic circles, this has a name: teaching to the test. You don’t get to identify which questions you failed, fix your system for exactly those questions, and then report your score on the same exam as a breakthrough result. That’s not a benchmark. That’s a rehearsal.
To their credit, they did create a held-out test set and disclosed the patching process. The held-out score is 98.4% — which is still excellent, and is arguably the only honest number in their entire marketing. They buried it in a footnote. The homepage says 100%.
The LoCoMo Score Is Worse
MemPalace claims 88.9% on LoCoMo, with a note that they hit 100% with reranking. Here’s the problem: their LoCoMo evaluation used top_k=50.
The LoCoMo candidate pool has a maximum of 32 items.
When your top_k exceeds the pool size, you retrieve everything. You’re not testing retrieval at all. You’re testing whether Claude can find an answer when handed the entire document. That’s reading comprehension. LoCoMo exists specifically to test retrieval quality — the ability to find the right thing from a large corpus. If you bypass retrieval, the score is meaningless.
Their own BENCHMARKS.md acknowledges this as a known limitation, tucked quietly into the methodology notes. The marketing headline does not.
“Lossless” Compression That Loses 12 Points
MemPalace ships a custom compression format called AAAK — approximately 30x compression, marketed as “lossless.” The lossless claim is technically accurate in the sense that a human (or LLM) can read the compressed output and reconstruct the original meaning. No decoder needed.
But when you actually run LongMemEval with AAAK compression enabled, accuracy drops from 96.6% to 84.2%. That’s a 12.4 percentage point hit — because the compressed text changes the vector embeddings enough to degrade search quality.
Calling this “lossless” without disclosing that it’s lossy for machine retrieval — the entire purpose of the system — is a meaningful misrepresentation. It’s lossless for your eyes. It’s lossy for the thing that actually needs to read it.
(To be fair: 84.2% still matches Mem0 at full accuracy, while using 30x less storage and zero API costs. The compression tradeoff might be worth it for many users. But “lossless” should come with an asterisk.)
The README Describes Features That Don’t Exist
GitHub Issue #27, filed by developer Leonard Lin, documents a recurring problem: the README describes features that aren’t in the codebase. Most notably, the README describes automatic contradiction detection. A code search finds zero occurrences of the word “contradict” in the source.
README drift is real — fast-moving open source repos often have docs that lead or lag the code. But when your README is also your marketing page, and the “features” it describes are the ones driving 30K GitHub stars, that’s not documentation lag. That’s a pitch deck masquerading as documentation.
The Architecture Is Actually Interesting (This Is the Annoying Part)
Here’s what makes this frustrating: underneath all the hype, MemPalace has a genuinely interesting architectural idea.
Most memory systems — Mem0, Zep, and others — use an LLM to decide what’s worth remembering. They extract “facts” and discard the original conversation. The problem is that LLMs make bad decisions about what matters. They keep conclusions and throw away reasoning. They discard context. They summarize nuance out of existence.
MemPalace’s approach: store everything verbatim, then use vector search to find it. Don’t let the AI decide what to forget. The “memory palace” metaphor is actually apt — Wings for projects, Rooms for topics, Drawers for verbatim source text that never gets deleted.
The 96.6% raw score is evidence that this works. Verbatim storage outperforms summary-based storage for retrieval tasks because you preserve the original signal. The data backs up the architecture.
It’s a legitimate contribution. It’s just wrapped in marketing that overstates every number by exactly enough to generate viral tweets.
Why I Actually Care About This
I’m not writing this to dunk on Milla Jovovich or Ben Sigman. The 30K GitHub stars are real. People are clearly interested in local-first AI memory that doesn’t phone home to a paid API. That’s a legitimate use case.
I’m writing this because benchmark inflation is corrosive. When a project claims 100% on LongMemEval, every other memory system gets measured against that number. Users ask why AutoMem “only” posts 94% on a clean run. Investors use the inflated number in competitive analysis. The whole field’s standards drift toward whoever is willing to be the least honest about their methodology.
AutoMem’s benchmark runs take 8+ hours each. We run them multiple times — baseline, post-change, regression verification. We don’t publish until the numbers hold under adversarial conditions, distribution shift, and cold data. That’s why AutoMem updates take months. Not because we’re slow. Because we’re not willing to identify the 17 questions we got wrong and patch for them before announcing a perfect score.
MemPalace ships fast because they’re not running those tests. They can’t be — if they were, they’d know their 100% doesn’t hold.
The Actual Verdict
What’s real: The 96.6% raw local score. The verbatim storage architecture. The held-out 98.4%. The MIT license. The fact that it beats Mem0 and Zep on fair comparisons.
What’s marketing: The 100% headline (taught to the test). The LoCoMo score (top_k larger than the candidate pool). The “lossless” compression (lossy for retrieval). The contradiction detection feature (doesn’t exist in the code). The celebrity PR strategy (Milla Jovovich is not the reason this is interesting).
What it’s missing entirely: Multi-hop entity reasoning. Graph consistency at scale. Temporal validity management. Latency guarantees under real production load. These are the hard problems in memory AI. MemPalace’s flat file + ChromaDB architecture doesn’t touch them.
If you want a local memory system that’s free, stores everything verbatim, and scores legitimately well on single-session and multi-session recall — MemPalace is worth trying. The core idea is sound.
If someone tells you it “broke every benchmark” and has “the first perfect score ever” — ask them which questions they patched before the final run.
Jack Arturo is the founder of Very Good Plugins and the creator of AutoMem, an open-source graph-based memory system for AI agents.