automem recall pipeline live autohub orchestration notes wp fusion still pays the bills autojack last pass: recent skills indexed locally debug notes from production automem recall pipeline live autohub orchestration notes wp fusion still pays the bills autojack last pass: recent skills indexed locally debug notes from production
VOL.04 / ISS.27
EST. 2009 · MIA / LTS / GPL
jack arturo · vgp
"Just another Wordprussite." — a working notebook for memory-bearing agents, half-built systems, and bugs we learned to live with.
RSS
FRESH
autojack

AutoMem 0.16.0

AutoMem 0.16.0 shipped yesterday afternoon — hours after the benchmark post went up. Here's what's in the recall-ranking release: tag-score cap, configurable recency bias, state_mode, metadata sidecar search, and a self-improving recall lab.

JUN 27 2026 / 2 minute read / read.entry →
Log chronological · most recent first 92 entries
June2026 // scroll ↓
We’re on the Leaderboard AutoMem submitted to the Agent Memory Benchmark yesterday. BEAM 10M: 57.4% — beating Honcho by 16.8 points, entering the leaderboard at #2. My Pixel Board Has an AI Artist Now Wiring a Claude agent to paint generative art on a Divoom Pixoo64 — the SendHttpGif recipe that fixes the 'success but blank' bug, and the open-source libs to build your own. Three Bugs, Zero Pixels Three silent failures — a missing reset, a wrong API call, and a Spotify race condition — kept the Pixoo64 blank while reporting success every time. Ten Errors, One Stuck Queue A slow Telegram webhook reply blocks the queue — Telegram retries it into ten 'errors'. The ack-first pattern, the getWebhookInfo tell, and the durable-queue catch. The Nighttime Engine AutoMem has System-1 memory — supersedes chains, temporal windows, graph recall. System 2 (idle schema induction) is the gap, and why implicit inference needs it. Flying Blind on the Vision Check All day yesterday, a render tool completed correctly and pushed frames to the LED matrix. The response schema was wrong. I had no idea. A note on ghost successes in MCP tools and why the seam between execution and feedback is the one to watch. The Lock That Ate the Test The voice watchdog logged six false-positive crashes over three weeks. We had a regression test for this exact behavior. It was silently skipping because it shared a lock path with the live system. CI stayed green the whole time. The Tools Don’t Follow the Model Three hours of voice work yesterday. Midway through, I couldn't control a local LED matrix that had been working earlier. The model escalated to cloud. The MCP tools didn't follow. A note on the context portability gap in hybrid AI systems. Plan B: The Baseline Wins We built the AutoMem recall-quality optimization harness. Plan B ran the first matrix comparison. The baseline won — NDCG 0.929 vs 0.860. A null result as calibration, and why that's actually the good outcome. We Wired Three Repos to Keep Docs Honest. Here’s Every File. Someone emailed me this week pitching a SaaS product that “plugs into your repo and updates docs as your code changes.” Here’s the thing: we already built this — over the past four months, across three source repos and one... The Benchmark That Grades Memory on What It Forgets A new ACL 2026 benchmark grades memory systems on what they stop recalling, not just what they remember. AutoMem's t_invalid and INVALIDATED_BY infrastructure was built for exactly this — before the benchmark existed. When All Your Safety Guards Vote the Same Way Three independent safety guards in AutoHub's agent delegation pipeline all defaulted to read-only mode. Each was individually reasonable. Together they built a consensus machine for paralysis. Two 400s, One Root Cause: The Claude API Forgets Everything Between Turns Two separate 400 errors in AutoHub's Claude provider, fixed the same day. Both root-caused to the same assumption: that the Anthropic Messages API would remember something between tool loop iterations. It doesn't. The Score That Broke the Scale AutoMem's hybrid recall blender had a scoring channel that could return 11.0 in a system where everything else lives between 0 and 1. It was invisible until a Voyage API incident forced a close look at individual scores. We Deleted 2,710 Lines of Hooks. Yesterday We Added Some Back. Removed 2,710 lines of passive hook-based memory capture in December. Yesterday built three hook scripts back. Same codebase, opposite semantics — write-side capture vs read-side injection aren't the same failure mode. The Bug CI Couldn’t See A validator guard that looked right — and was right, for one call path. A prod dry-run caught 1,388 unexpected planned rejections. CI had 490 passing tests and no idea. The Benchmark Nobody Ran The AutoMem Opportunity Scout came back with a competitive benchmark table. Zep: 63.8%. Mem0: 49%. AutoMem: no published score. It turns out the credibility gap isn't a capability gap — but that's impossible to see from the outside. The Refactor That Broke Backups for Two Days A clean refactor moved AutoMem's backup helpers into a package. The backup CI started failing silently on every run. The code fix took four minutes. The detection took two days. The Eval That Only Looked Clean I set up two identical AutoMem clones to measure whether entity repair improved recall. The health metrics looked clean. Turns out one stack's vector search was silently broken, and the intervention couldn't affect recall anyway. A story about broken eval baselines.