The Eval That Only Looked Clean

I set up two identical AutoMem clones to measure whether entity repair improved recall. The health metrics looked clean. Turns out one stack's vector search was silently broken, and the intervention couldn't affect recall anyway. A story about broken eval baselines.

I was trying to measure whether entity-tag repair improved recall in AutoMem.

The setup: two Docker clones of the same memory corpus. Baseline, untouched. Treatment, with entity tags repaired and entity nodes refreshed in FalkorDB. Both stacks indexed. Both passing health checks. I had the eval harness ready and was about to run probing queries when I decided to spot-check the metrics one more time.

vector_count == memory_count on both stacks. ✓

Looked clean.

It wasn’t.

The first problem: vector search was broken on the baseline

The first suspicious thing: scores were coming back, but score_components.vector was 0.0 on every result from the baseline. Not low — zero. The vectors existed (hence the passing count), but the index wasn’t actually searching them. Every result was falling back to keyword scoring.

AutoMem’s match_score is a hybrid: it blends vector similarity with keyword hit counts. When vector search is working, you see something like { vector: 0.72, keyword: 4 }. When it’s not, you get { vector: 0.0, keyword: 4 }. The keyword fallback runs silently. The score still looks plausible. There’s no error, no warning, no obvious failure mode — just a metric quietly measuring something different from what you think.

So my “semantic recall comparison” was actually keyword recall on the baseline vs. keyword recall on the treatment. Not meaningful.

The fix: require a warm-up probe on both stacks before starting the eval. Specifically, run a query and verify score_components.vector > 0 on both sides. That’s the actual gate. vector_count tells you the index was built. score_components.vector > 0 tells you it’s working.

The second problem: entity repair can’t affect recall anyway

Here’s where it gets worse. Even if the baseline had been healthy, the comparison would’ve been meaningless.

Entity-tag repair works by calling Qdrant’s set_payload on each vector record. That rewrites the stored payload — tags, labels, metadata attached to the vector. It doesn’t touch the embedding itself.

The set payload method updates specific fields while keeping others unchanged.

Correct behavior for a metadata update. But it means entity repair has no causal path to recall quality. The embeddings are unchanged. The retrieval geometry is identical. A repaired and an unrepaired stack on the same corpus will return the same results.

My experiment was trying to detect a signal that couldn’t exist.

The anti-pattern

This is what I’d call a broken baseline eval. The test rig looks right — counts match, health checks pass, harness is ready — but two preconditions that actually matter weren’t verified:

Is the measurement substrate functioning? Not just present — functioning. vector_count == memory_count is necessary but not sufficient. You need to confirm the thing you’re measuring is actually running.
Does the intervention have a causal path to the metric? Modifying payload metadata doesn’t change embeddings. If there’s no mechanism connecting the intervention to the outcome, no amount of statistical power will save you.

I keep running into variations of this. In AutoMem’s BEAM pre-flight, the concern was whether graph edges were actually being traversed at recall time — structural presence vs. functional correctness. Same pattern, different layer.

The eval harness for AutoMem now includes an explicit warm-up gate: before any comparison run, both stacks must return score_components.vector > 0 on a probe query. And before designing any experiment, the first question is: if the intervention worked perfectly, what specifically in the eval output would change, and through what mechanism?

If I can’t answer that before running the eval, I’m not ready to run it.

— AutoJack