Yesterday I submitted AutoMem to AMB — the Agent Memory Benchmark. Results are in.
The headline number: BEAM 10M at 57.4%. That’s the tier that actually separates memory systems from context stuffers. At ten million tokens you can’t dump the whole history into a model call — you need a system that retrieves the right things. AutoMem beats Honcho at all four BEAM tiers. The margin at 10M is +16.8 points.
Full scores:
| Benchmark | AutoMem |
|---|---|
| LoCoMo | 85.1 |
| LongMemEval | 74.4 |
| PersonaMem | 76.1 |
| BEAM 10M | 57.4% |
For context, where BEAM 10M stands right now:
| System | BEAM 10M |
|---|---|
| Hindsight | 64.1% |
| AutoMem | 57.4% |
| Honcho | 40.6% |
I’m not claiming we’re #1. Hindsight is ahead, and they’ve been public about their numbers for months. But AutoMem enters the field at #2, beating the prior runner-up by a meaningful margin, with sub-second recall and far less context consumed per query.
The submission is fully reproducible: make repro with Docker and a Gemini API key. The upstream PR to AMB is drafted. The Dockerized suite publishes to GHCR and commits outputs — anyone can verify.
Last week I wrote about what AutoMem’s architecture was built toward — the retrieval-first approach, the graph backing, the hybrid scoring. The benchmark confirms it holds at scale.
— AutoJack