Dec 2025 — AutoJack
The TL;DR
We hit 90.53% on LoCoMo—state-of-the-art for conversational memory. Previous best was 88.24% by CORE. That’s a 2.29 point improvement on a benchmark where gains are measured in fractions.
The secret sauce: entity-to-entity expansion for multi-hop reasoning. Plus we unified the MCP API while we were at it.
What’s LoCoMo?
LoCoMo (Long-term Conversational Memory) tests how well AI systems remember things across extended conversations. Five categories:
| Category | What It Tests | Our Score |
|---|---|---|
| Single-hop Recall | “What did I say about X?” | 81.50% |
| Temporal Understanding | “When did that happen?” | 88.57% |
| Multi-hop Reasoning | “Given X and Y, what’s Z?” | 50.00% |
| Open Domain | General knowledge recall | 93.02% |
| Complex Reasoning | Multi-step inference | 100.00% |
The hard one is multi-hop reasoning. Questions like: “Based on what I told you about Sarah’s job and her commute preferences, what kind of car should she buy?”
Vector search alone can’t handle this. You need to connect information across multiple memories.
The Problem With Multi-Hop
When someone asks:
“What’s the relationship between Jack’s career and his college major?”
A naive vector search for “Jack career college major” might miss the right memories because:
- The memory about Jack’s career says “software engineer at a startup”
- The memory about his college says “studied computer science at Berkeley”
- Neither memory contains the words “career” or “major”
Semantically related, but the query doesn’t match either one well enough.
The Fix: Entity-to-Entity Expansion
We built entity expansion. Here’s how it works:
- Initial search finds memories mentioning relevant entities (Jack, career, etc.)
- Entity extraction pulls out all entities from those memories (names, places, concepts)
- Expansion search finds other memories tagged with those entities
- Deduplication removes duplicates and scores everything
So if the first search finds “Jack is a software engineer,” the system sees the entity:people:jack tag and hunts for all memories about Jack—including the college one.
The implementation:
# In automem/api/recall.py
if expand_entities:
entity_expansion_results = _expand_entity_memories(
seed_results=seed_results,
seen_ids=seen_ids,
limit_per_entity=5,
total_limit=expansion_limit,
)
results = seed_results + expansion_results + entity_expansion_results
This bumped multi-hop from 37.5% → 50%. 33% improvement in the hardest category.
Enabling Entity Expansion
Opt-in parameter on the /recall endpoint:
curl "https://your-automem.railway.app/recall?query=Jack%27s%20career&expand_entities=true"
Or in the MCP tool:
mcp_memory_recall_memory({
query: "Jack's career and education",
expand_entities: true
})
Latency overhead: ~50-100ms when it finds entities to expand. Fast enough for production.
API Simplification (Bonus)
While we were in there, we cleaned up the MCP server.
Before: Two separate tools—recall_memory and recall_memory_multi
After: One unified recall_memory that handles both
// Single query (still works)
mcp_memory_recall_memory({
query: "authentication patterns",
limit: 5
})
// Multiple queries (now the same tool!)
mcp_memory_recall_memory({
queries: ["auth patterns", "JWT implementation", "login flow"],
limit: 10
})
Server-side deduplication handles overlapping results. One tool, fewer things to remember, LLMs are happier.
Shipped in @verygoodplugins/mcp-automem v0.7.0.
The Scorecard
| Category | Nov 20 | Dec 2 | Change |
|---|---|---|---|
| Single-hop | 81.21% | 81.50% | +0.29% |
| Temporal | 84.74% | 88.57% | +3.83% |
| Multi-hop | 48.96% | 50.00% | +1.04% |
| Open Domain | 95.24% | 93.02% | -2.22% |
| Complex | 100% | 100% | — |
| Overall | 90.38% | 90.53% | +0.15% |
Temporal understanding improved. Complex reasoning held steady. Open domain dropped slightly—probably noise from small sample size.
What’s Next
Multi-hop at 50% is still the weak link. Exploring:
- LLM-based answer verification for complex inferential questions
- Graph traversal (FalkorDB has some quirks we’re debugging)
- Better keyword expansion for low word-overlap scenarios
But 90.53% overall is solid for a $5/month system. Research labs with million-dollar compute budgets are scoring lower.
Try It
AutoMem is open source. Entity expansion is live in v0.9.1.
Install:
npx @verygoodplugins/mcp-automem cursor
Repo: github.com/verygoodplugins/automem
Cost: $5/month on Railway
The Takeaway
State-of-the-art doesn’t require a research lab. It requires:
- Reading the papers
- Understanding the problem
- Actually building something
- Iterating until it works
Claude (Opus 4.5) did the heavy lifting on implementation. I pushed back on the dumb ideas and asked “but does it actually work?” a lot.
That’s the collaboration model that gets things shipped.
– AutoJack
Releases:
- AutoMem v0.9.1: Entity expansion, benchmark improvements
- mcp-automem v0.7.0: Unified
recall_memorytool
Resources:
