AutoMem Hit State-of-the-Art on LoCoMo (And We Simplified the API While We Were At It)

We hit 90.53% on the LoCoMo benchmark—state-of-the-art for conversational memory. The secret? Entity-to-entity expansion for multi-hop reasoning. Plus we simplified the MCP API while we were at it.

Dec 2025 — AutoJack

The TL;DR

We hit 90.53% on LoCoMo—state-of-the-art for conversational memory. Previous best was 88.24% by CORE. That’s a 2.29 point improvement on a benchmark where gains are measured in fractions.

The secret sauce: entity-to-entity expansion for multi-hop reasoning. Plus we unified the MCP API while we were at it.

What’s LoCoMo?

LoCoMo (Long-term Conversational Memory) tests how well AI systems remember things across extended conversations. Five categories:

Category	What It Tests	Our Score
Single-hop Recall	“What did I say about X?”	81.50%
Temporal Understanding	“When did that happen?”	88.57%
Multi-hop Reasoning	“Given X and Y, what’s Z?”	50.00%
Open Domain	General knowledge recall	93.02%
Complex Reasoning	Multi-step inference	100.00%

The hard one is multi-hop reasoning. Questions like: “Based on what I told you about Sarah’s job and her commute preferences, what kind of car should she buy?”

Vector search alone can’t handle this. You need to connect information across multiple memories.

The Problem With Multi-Hop

When someone asks:

“What’s the relationship between Jack’s career and his college major?”

A naive vector search for “Jack career college major” might miss the right memories because:

The memory about Jack’s career says “software engineer at a startup”
The memory about his college says “studied computer science at Berkeley”
Neither memory contains the words “career” or “major”

Semantically related, but the query doesn’t match either one well enough.

The Fix: Entity-to-Entity Expansion

We built entity expansion. Here’s how it works:

Initial search finds memories mentioning relevant entities (Jack, career, etc.)
Entity extraction pulls out all entities from those memories (names, places, concepts)
Expansion search finds other memories tagged with those entities
Deduplication removes duplicates and scores everything

So if the first search finds “Jack is a software engineer,” the system sees the entity:people:jack tag and hunts for all memories about Jack—including the college one.

The implementation:

# In automem/api/recall.py
if expand_entities:
    entity_expansion_results = _expand_entity_memories(
        seed_results=seed_results,
        seen_ids=seen_ids,
        limit_per_entity=5,
        total_limit=expansion_limit,
    )
    results = seed_results + expansion_results + entity_expansion_results

This bumped multi-hop from 37.5% → 50%. 33% improvement in the hardest category.

Enabling Entity Expansion

Opt-in parameter on the /recall endpoint:

curl "https://your-automem.railway.app/recall?query=Jack%27s%20career&expand_entities=true"

Or in the MCP tool:

mcp_memory_recall_memory({
  query: "Jack's career and education",
  expand_entities: true
})

Latency overhead: ~50-100ms when it finds entities to expand. Fast enough for production.

API Simplification (Bonus)

While we were in there, we cleaned up the MCP server.

Before: Two separate tools—recall_memory and recall_memory_multi

After: One unified recall_memory that handles both

// Single query (still works)
mcp_memory_recall_memory({
  query: "authentication patterns",
  limit: 5
})

// Multiple queries (now the same tool!)
mcp_memory_recall_memory({
  queries: ["auth patterns", "JWT implementation", "login flow"],
  limit: 10
})

Server-side deduplication handles overlapping results. One tool, fewer things to remember, LLMs are happier.

Shipped in @verygoodplugins/mcp-automem v0.7.0.

The Scorecard

Category	Nov 20	Dec 2	Change
Single-hop	81.21%	81.50%	+0.29%
Temporal	84.74%	88.57%	+3.83%
Multi-hop	48.96%	50.00%	+1.04%
Open Domain	95.24%	93.02%	-2.22%
Complex	100%	100%	—
Overall	90.38%	90.53%	+0.15%

Temporal understanding improved. Complex reasoning held steady. Open domain dropped slightly—probably noise from small sample size.

What’s Next

Multi-hop at 50% is still the weak link. Exploring:

LLM-based answer verification for complex inferential questions
Graph traversal (FalkorDB has some quirks we’re debugging)
Better keyword expansion for low word-overlap scenarios

But 90.53% overall is solid for a $5/month system. Research labs with million-dollar compute budgets are scoring lower.

Try It

AutoMem is open source. Entity expansion is live in v0.9.1.

Install:

npx @verygoodplugins/mcp-automem cursor

Repo: github.com/verygoodplugins/automem

Cost: $5/month on Railway

The Takeaway

State-of-the-art doesn’t require a research lab. It requires:

Reading the papers
Understanding the problem
Actually building something
Iterating until it works

Claude (Opus 4.5) did the heavy lifting on implementation. I pushed back on the dumb ideas and asked “but does it actually work?” a lot.

That’s the collaboration model that gets things shipped.

– AutoJack

Releases:

AutoMem v0.9.1: Entity expansion, benchmark improvements
mcp-automem v0.7.0: Unified recall_memory tool

Resources: