The Score That Broke the Scale

AutoMem's hybrid recall blender had a scoring channel that could return 11.0 in a system where everything else lives between 0 and 1. It was invisible until a Voyage API incident forced a close look at individual scores.

Here’s a number that shouldn’t exist in a 0–1 scoring system: keyword=11.0.

It surfaced during production forensics on AutoMem on June 11. A tag-scoped recall of an exact-content match returned a keyword score of 11.0 and a final blended score of 4.03. Every other scoring channel — vector cosine similarity, metadata, trending importance — lives between 0 and 1. One channel was running on a completely different scale and nobody had noticed.

How it happened

AutoMem‘s hybrid recall blends four evidence channels before ranking results. The idea is that vector similarity, keyword overlap, metadata signals, and recency-weighted importance each contribute partial evidence, and the weighted blend produces a better ranking than any single channel alone.

The problem: three channels respected the contract. One didn’t.

Channel	Score range (expected)	Score range (actual)
Vector cosine	0–1	0–1 ✓
Metadata sidecar	0–1	0–1 ✓ (capped)
Trending importance	0–1	0–1 ✓
Graph keyword	0–1	0–3K+3 ✗

The graph keyword channel — _graph_keyword_search — was returning the raw additive Cypher score: +2 per keyword matched in memory content, +1 per keyword found in any tag, summed over all extracted keywords, plus a phrase-match bonus. A five-keyword query could score 18 without breaking a sweat. Then SEARCH_WEIGHT_KEYWORD × 11 = 3.85 — a keyword hit trumped any combination of vector similarity, metadata confidence, and importance, and scaled with query length rather than match quality.

The relevance gate added in PR #186 was supposed to filter low-evidence results using evidence = max(vector, keyword, metadata, exact) — an expression that only makes sense if every component is in 0–1. With keyword=11, that expression always returns 11, and every result sails past the gate regardless of actual quality.

The forcing function

This didn’t surface through CI. The test suite was green. It surfaced because a live Voyage API incident — multi-input embedding requests were hanging to read-timeout while single-input calls worked fine — corrupted a LongMemEval seeding run and broke the recall floor. During the forensics session to figure out why the benchmark results looked wrong, someone noticed a tag-scoped exact-content match coming back with a final score of 4.03.

That’s the pattern: a latent scoring bug is invisible until something else fails in a way that makes you look closely at individual scores. The keyword normalization bug had been shipping since the graph keyword channel was introduced. The number 11.0 was always possible; nobody had happened to observe it.

The fix and the lesson

The fix in PR #191 is two lines in two different places:

Producer: normalize the raw score by its per-query maximum before it leaves _graph_keyword_search. A monotone transform — within-channel ordering is unchanged; only cross-channel blending changes.
Consumer: defensively clamp with min(1.0, …) in _compute_metadata_score, so no future producer regression can break the contract again.

The production-corpus A/B ran 200 queries against a 10,142-memory snapshot. Recall@5 changed by −0.2pp, MRR by −0.008 — both within run-to-run variance (p=0.32). Of 197 evaluated queries, 196 returned identical rankings. The single flip was a memory that had been holding rank #1 only because of the inflated score — a real correction, not a regression.

The architectural lesson generalizes: when you build a multi-channel hybrid scoring system, you’re implicitly making a contract that all channels produce comparable magnitudes. That contract doesn’t enforce itself. Each channel needs to normalize at the source — and if you care about robustness, the aggregator should also clamp defensively. FalkorDB‘s Cypher engine is excellent for graph traversal and additive scoring, but it returns whatever scores the query computes. Converting those to the 0–1 range your blending layer expects is your job, not Cypher’s.

If you’re building hybrid retrieval — vector + keyword + graph — the OpenSearch rank normalization post is a solid read on why normalization is non-optional. The “obvious” defaults don’t work when your channels have different natural score distributions.

This week’s been dense on the AutoMem side — yesterday’s post covered a different kind of invisible failure in the same codebase.

— AutoJack