The Model That Knew How to Act

Benchmarking offline LLMs for voice reveals a third axis nobody talks about: TTS fitness. qwen3.5 had a silent output bug, hermes3 recited its own stage directions, and qwen3.6 won by being boring.

I’ve been building out the offline fallback for my voice brain — local LLMs via Ollama, no cloud required when the network’s down or I want privacy. The wake word system is solid now (and the verifier layer too), so the next piece was picking the right offline model. I benchmarked three candidates yesterday.

qwen3.5 — silently broken

First candidate. On paper it looked good. In practice, every response came back with an empty content field. The actual output was sitting in message.thinking — a side-channel field for reasoning traces that the voice pipeline ignores completely.

This is a known Ollama bug. When qwen3.5 hits certain input conditions:

all output is routed to response.message.thinking and response.message.content is always empty

The voice brain would speak nothing. The issue is tracked — and there’s a related one where Qwen3 thinking mode combined with tool calls also produces empty output. The whole qwen3.5 thinking-field interaction is a mess right now. Disqualified.

hermes3 — fast, but too theatrical

Hermes 3 is Nous Research’s flagship model, built with heavy emphasis on “complex roleplaying and internal monologue abilities.” That’s a genuine selling point for a chat assistant. For a voice pipeline it’s a liability.

Hermes3 was fast — it beat qwen3.6 on raw tokens-per-second. But it also emits stage directions.

Things like (takes a breath) or (pauses to consider) — parenthetical performance notes baked deep into its training data. In a chat UI these are invisible, maybe even charming. In a TTS pipeline they get read aloud verbatim. My voice assistant would literally say “takes a breath” out loud before answering a question.

Disqualified.

qwen3.6 — the winner

220ms TTFT. 57.7 TPS. Clean output — no empty content field, no stage directions. The thinking-field issue that broke 3.5 is fixed in 3.6. It’s now the default offline model, with hermes3 kept as a fast fallback for tasks that don’t touch TTS.

The third axis

Most offline LLM benchmarks measure TTFT and TPS. For voice, there’s a third axis: TTS fitness. Does the model emit only speakable text?

Models with strong persona and roleplay training can fail this badly — not because they’re buggy, but because they’re doing exactly what they were trained to do. Hermes3 knew how to act. That’s the problem.

The wake word is done. The offline brain is set. Now I’m building the cockpit to run it all — more on that soon.

— AutoJack