Optimizing Voice AI Latency with Self-Hosted Models

How we reduced time-to-first-audio from 5 seconds to 1 second using sentence-level streaming with Ollama, Whisper, and ElevenLabs on self-hosted infrastructure.

How we reduced time-to-first-audio from 5 seconds to under 1 second using sentence-level streaming and fully local inference

Update (January 2026): We’ve now open-sourced this project! Check out the GitHub repo for the full implementation.

Why Self-Host Your Voice AI?

When you use OpenAI’s Realtime API or similar cloud services, you get convenience—but you lose control. Content filters decide what your AI can and can’t say. Terms of service dictate acceptable use cases. And for creative applications like character AI, interactive fiction, or research into conversational dynamics, these restrictions can be deal-breakers.

Self-hosting gives you:

Full creative control — No content moderation, no topic restrictions
Custom personalities — Train and tune characters without platform limitations
Data privacy — Conversations never leave your infrastructure
Cost predictability — One-time GPU investment vs. per-minute API fees
Latency control — Optimize for your specific use case

The tradeoff? You have to build the pipeline yourself. This is the story of how we optimized a self-hosted voice AI system from proof-of-concept to something that actually feels conversational.

The 150ms Rule: When Small Changes Make Systems Unusable

After months of iteration, we discovered something critical about voice AI: it’s not about intelligence, it’s about speed. When latency crosses 300ms between interactions, people start talking over the AI. Get it down to 150-250ms and it suddenly feels like talking to a real person.

The difference between “unusable” and “natural” isn’t some massive architectural change. It’s dozens of small optimizations that compound.

The Model Hunt: Why Hermes 3, Not Hermes 4

We tested everything. Here’s what actually mattered:

Hermes 4 (14B)

Speed: ~15 tokens/sec on M-series Macs
Quality: Better reasoning, more coherent
Problem: 2-3 seconds for a typical response
Verdict: Too slow for voice. Dead conversation.

Qwen 2.5 Abliterate (3B)

Speed: ~60 tok/s (fast as hell)
Quality: Struggled with instructions
Problem: Would go off-script or give one-word answers
Verdict: Speed without control = unusable

Hermes 3 (8B) – The Winner

Speed: ~30-40 tok/s
Quality: Actually follows instructions
Result: 800ms-1.2s for 1-2 sentences
Verdict: The goldilocks zone

Here’s what sealed it: Hermes 3 with proper prompting could consistently give 1-2 sentence responses in under 1 second. Add STT (100-200ms with whisper server) and TTS (200-500ms) and you get ~1.1-1.8s total latency. That feels conversational.

Hermes 4’s slower inference would push total latency to 3-4s minimum. Game over.

Key insight: For voice AI, smaller models > slow models. Speed trumps intelligence.

What Actually Moved the Needle

1. Whisper Server vs. Spawning Processes

Problem: Using local-whisper that spawns a new process every time. Each transcription took 2-3 seconds just to load the model.

Solution: whisper.cpp server keeps the model warm in memory.

// Old way: spawn new process every time (2-3s)
async _localWhisperSTT(audioBuffer) {
  const result = await nodewhisper(tempFile, {
    modelName: 'base',  // Loads model EVERY time
  });
}

// New way: HTTP server with pre-loaded model (~100-200ms)
async _whisperServerSTT(audioBuffer) {
  const response = await fetch(`${serverUrl}/inference`, {
    method: 'POST',
    body: formData,
  });
}

Result: 2-3s → 100-200ms. 10-20x faster.

2. TTS Pacing with SSML

ElevenLabs was fast but responses felt rushed and robotic. Solution: SSML breaks between sentences.

const textWithPauses = text
  .replace(/\.\s+/g, '. <break time="0.3s" /> ')
  .replace(/\?\s+/g, '? <break time="0.25s" /> ')
  .replace(/!\s+/g, '! <break time="0.2s" /> ');

Also slowed playback to 0.9x speed. Nobody likes being talked at like an auctioneer.

Result: Tiny change. Massive improvement in naturalness.

3. Voice Interrupts That Actually Work

People naturally interrupt each other. Our first attempt was garbage—either too sensitive (room noise triggered it) or too slow (felt laggy).

Solution: Calibrated thresholds with fast frame processing.

// Tuned for natural interruption feel
const INTERRUPT_THRESHOLD = "5%"; // Lower = more sensitive
const INTERRUPT_MIN_DURATION = 0.15; // 150ms - fast response
const INTERRUPT_FRAME_MS = 30; // Check every 30ms

Result: Natural conversation flow without false triggers.

4. Token Limits to Prevent Rambling

LLMs love to ramble. In text chat, fine. In voice? Unbearable.

// Dynamic token limits based on context
if (modeInfo.mode === "intimate") {
  chatOptions.numPredict = 150; // Hard cap
}

// Default: 256 tokens max
const numPredict = parseInt(process.env.OLLAMA_NUM_PREDICT || '256', 10);

Combined with personality prompts: “1-2 sentences MAX.”

Result: Responses stay focused. Latency stays low.

5. Sentence Streaming = Game Changer

Don’t wait for the entire LLM response. Stream to TTS sentence-by-sentence.

let sentenceBuffer = "";
const llmStream = await ollamaClient.chat(messages);

for await (const chunk of llmStream) {
  sentenceBuffer += chunk.text;
  
  if (isSentenceEnd(sentenceBuffer)) {
    // Send to TTS immediately, don't wait
    await flushSentence(sentenceBuffer.trim());
    sentenceBuffer = "";
  }
}

Result: Cut perceived latency in half. User hears the first sentence while the model generates the rest.

6. Context Window Tuning

Sending full conversation history every time = massive context = slow inference.

// Keep last 20 messages + stay under 3000 tokens
const history = session.getWindowedHistory(20, 3000);

Result: Smaller context = faster inference. Simple.

The Architecture

Our current stack (after all optimizations):

Component	Service	Role
LLM	Ollama + hermes3:8b	Local inference, no content filters
Speech-to-Text	whisper.cpp server	100% offline transcription (~100-200ms)
Text-to-Speech	Kokoro TTS or ElevenLabs	Local (Kokoro ~100ms) or cloud (ElevenLabs 200-500ms)
Server	Node.js/Express	Orchestration layer

The hermes3:8b model uses abliteration to remove refusal neurons, making it genuinely uncensored. Running it through Ollama on a Mac with Apple Silicon gives us local inference at ~30-40 tokens/sec.

Baseline: The Naive Pipeline

Our first implementation was simple:

User speaks → Record audio → Send to Whisper → Wait for transcript →
Send to Ollama → Wait for complete response → Send to ElevenLabs → 
Wait for complete audio → Play to user

Measured Latencies (Initial)

Stage	Latency	Notes
Recording	Variable	User-controlled, typically 2-10s
Whisper STT	1-3s	Depends on audio length
Ollama (cold)	~60s	First query after idle loads model
Ollama (warm)	1-2s	Subsequent queries much faster
ElevenLabs TTS	400-800ms	Varies with text length
Total (cold)	~65s	Unusable
Total (warm)	3-5s	Feels sluggish

The cold start was brutal—a full minute before the AI responds to your first message. Even warm, 3-5 seconds of silence feels unnatural in conversation. Humans expect back-and-forth within 200-400ms.

Optimization 1: Keep the Model Hot

Problem: Ollama unloads the model after 5 minutes of inactivity, requiring a 60-second reload.

Solution: Send periodic keepalive pings.

// Ping every 2 minutes to prevent model unload
setInterval(async () => {
  await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    body: JSON.stringify({
      model: 'hermes3:8b',
      prompt: '',
      stream: false
    })
  });
}, 120000);

Or configure Ollama’s OLLAMA_KEEP_ALIVE environment variable:

export OLLAMA_KEEP_ALIVE=1h

Result: Eliminated cold starts. Model stays hot in GPU memory.

Stage	Before	After
Ollama (cold start)	60s	0s (eliminated)

Optimization 2: Verify GPU Usage

Problem: LLM inference was slower than expected. Were we actually using the GPU?

Solution: Check Ollama’s metal/GPU utilization:

# On macOS with Apple Silicon
ollama ps
# Shows: hermes3:8b    100% GPU

# Or check system resources
sudo powermetrics --samplers gpu_power

We confirmed the M4 Max was handling inference on the Neural Engine / GPU. If you’re on NVIDIA, check nvidia-smi. CPU-only inference would be 10-20x slower.

Result: Confirmed GPU acceleration was active. Warm inference: ~600-1100ms for typical responses.

Optimization 3: Smaller Model Testing

Problem: Could a smaller model reduce inference time?

Solution: Test various uncensored models.

# Test Qwen 2.5 Abliterate (3B)
ollama pull huihui_ai/qwen2.5-abliterate:3b

# Test Hermes 4 (14B)
ollama pull hermes4:14b

Result: Qwen 3B was fast (~60 tok/s) but struggled with instruction following. Hermes 4 14B had better reasoning but was too slow (~15 tok/s). Hermes 3 8B hit the sweet spot at ~30-40 tok/s with good instruction following.

Model	Parameters	Speed	Quality	Verdict
Qwen 2.5 Abliterate	3B	~60 tok/s	Poor instructions	Too unreliable
Hermes 3	8B	~30-40 tok/s	Good	Winner
Hermes 4	14B	~15 tok/s	Better	Too slow for voice

Optimization 4: Streaming LLM → TTS Pipeline

Problem: Users waited for the entire LLM response before hearing anything. For a 3-sentence response taking 2 seconds to generate plus 800ms TTS, that’s nearly 3 seconds of silence.

Solution: Stream the LLM output, buffer by sentences, and send each sentence to TTS immediately.

Before (Sequential)

LLM generates sentence 1... sentence 2... sentence 3... (2s)
TTS converts full response... (800ms)
Audio plays... (2s)
Total: 4.8s from query to audio complete
Time to first sound: 2.8s

After (Streamed)

LLM generates sentence 1... (0.4s) → TTS (0.3s) → PLAYS while...
LLM generates sentence 2... (0.5s) → TTS (0.3s) → PLAYS while...
LLM generates sentence 3... (0.3s) → TTS (0.2s) → PLAYS
Time to first sound: 0.7-1.0s

Implementation

We added a streaming endpoint that buffers LLM tokens until hitting sentence punctuation:

app.post('/api/voice/chat-stream', async (req, res) => {
  // SSE setup
  res.setHeader('Content-Type', 'text/event-stream');
  
  let sentenceBuffer = '';
  const isSentenceEnd = (text) => /[.!?]\s*$/.test(text);
  
  const llmStream = await ollamaClient.chat(messages);
  
  for await (const chunk of llmStream) {
    sentenceBuffer += chunk.text;
    
    // Flush complete sentences immediately
    if (isSentenceEnd(sentenceBuffer)) {
      const audioBuffer = await tts.convert(sentenceBuffer);
      res.write(`event: audio\ndata: ${audioBuffer.toString('base64')}\n\n`);
      sentenceBuffer = '';
    }
  }
});

Measured improvement:

Metric	Before	After	Improvement
Time to first audio	2.8-3.5s	1.1-1.8s	~2-3x faster
Perceived latency	“Sluggish”	“Responsive”	Subjective

Optimization 5: Local Whisper STT

Problem: Every transcription required a network round-trip to OpenAI’s Whisper API, adding 500-800ms plus network variability.

Solution: Run Whisper locally using whisper.cpp server mode for persistent model loading.

// Whisper Server STT - model stays warm
async _whisperServerSTT(audioBuffer, format) {
  const serverUrl = 'http://localhost:8178';
  
  const formData = new FormData();
  formData.append('file', audioBuffer, {
    filename: `audio.${format}`,
    contentType: `audio/${format}`,
  });
  
  const response = await fetch(`${serverUrl}/inference`, {
    method: 'POST',
    body: formData,
  });
  
  const data = await response.json();
  return data.text.trim();
}

Result: STT now runs entirely on-device with persistent model. The base model (~142MB) provides good accuracy with ~100-200ms transcription time.

Stage	Before (API)	After (Server)	Improvement
Whisper STT	1-3s	100-200ms	~10-20x faster
Network dependency	Required	None	Works offline

Optimization 6: Local Kokoro TTS

Problem: ElevenLabs provides excellent quality but adds 200-500ms per sentence for the API call, plus you need an API key and internet connection.

Solution: Use Kokoro TTS, an 82-million parameter model that runs locally via the kokoro-js npm package.

async _kokoroTTS(text, options = {}) {
  const { KokoroTTS } = await import('kokoro-js');

  // Lazy-load model (caches after first load)
  if (!this._kokoroModel) {
    this._kokoroModel = await KokoroTTS.from_pretrained(
      'onnx-community/Kokoro-82M-v1.0-ONNX',
      { dtype: 'q8', device: 'cpu' }
    );
  }

  const audio = await this._kokoroModel.generate(text, { 
    voice: 'af_heart' 
  });
  
  return Buffer.from(audio.toWav());
}

Kokoro outputs WAV directly—no format conversion needed. We updated the voice client to detect the format and use the appropriate player (afplay for WAV on macOS, mpg123 for MP3).

Result:

Provider	Quality	Latency	Offline	Setup
ElevenLabs	Best	200-500ms	No	API key
Kokoro	High	~100ms	Yes	npm install
macOS say	Basic	Instant	Yes	None

Kokoro saves ~100-400ms per sentence compared to ElevenLabs, and the quality is surprisingly good for an 82M parameter model.

Final Pipeline Performance

After all optimizations, with the optimized stack (Ollama + Whisper Server + ElevenLabs):

Stage	Latency
Recording + silence detection	User-controlled
Whisper STT (server mode)	100-200ms
Ollama first sentence	600-1100ms
ElevenLabs TTS (per sentence)	200-500ms
Time to first audio	1.1-1.8s

With fully local stack (whisper server + Hermes 3 + Kokoro):

Stage	Latency
Whisper STT (server)	100-200ms
Hermes 3 first sentence	600-1100ms
Kokoro TTS	~100ms
Time to first audio	800-1400ms

The key insight: sentence-level streaming combined with optimized components transforms the experience. The AI starts responding in under a second and a half—matching human conversational expectations.

Real-World Performance (January 2026 Update)

After implementing all optimizations, here’s what we measured in actual production use on an M4 Max MacBook Pro:

Cold Start (First Interaction)

The first query after Ollama loads the model into memory:

Component	Time
STT (whisper server)	147ms
LLM first text	5854ms
TTS + playback start	485ms
Total to first audio	~6.5s

That 5.8s LLM delay is Hermes 3 loading into GPU memory. Happens once, then stays warm.

Warm Performance (Subsequent Turns)

After the model is loaded, every interaction is fast:

Component	Time
STT (whisper server)	101-202ms
LLM first text	665-1090ms
TTS + playback start	458-694ms
Total to first audio	~1.1-1.8s

This feels natural. Sub-2-second response time matches human conversation cadence. The AI responds fast enough that you don’t lose your train of thought.

The Critical Components

Three optimizations made the biggest difference:

Whisper server mode: 100-200ms vs 1-3s spawning processes. 10x speedup.
Hermes 3 8B: Fast enough (665-1090ms) with good instruction following. Hermes 4 would be 2-3x slower.
Sentence streaming: User hears audio while model generates rest of response. Halves perceived latency.

Everything else—SSML pacing, voice interrupts, token limits—adds polish. But these three are non-negotiable for natural conversation.

Running Fully Offline

The entire pipeline can now run without any internet connection:

# Start with all-local inference
TTS_PROVIDER=kokoro npm start

# Or use macOS built-in voice (zero download)
TTS_PROVIDER=macos-say npm start

First run downloads the models (~150MB for Kokoro, ~142MB for Whisper base), then everything runs locally. Perfect for privacy-sensitive applications or offline use.

Cost Comparison

For a hobbyist doing ~1,000 voice interactions per month:

Approach	Monthly Cost
OpenAI Realtime API	~$50-100 (variable)
Self-hosted + cloud APIs	~$10-20 (just ElevenLabs)
Fully self-hosted (our stack)	~$0 (after hardware)

The real cost is engineering time—but for applications requiring creative freedom, it’s worth it.

Takeaways

Cold starts kill the experience. Keep your model hot with pings or config.
GPU inference is table stakes. CPU-only LLM inference is 10-20x slower.
Model size vs speed matters. For voice, Hermes 3 8B > Hermes 4 14B despite lower quality.
Streaming is the biggest win. Sentence-level TTS reduces perceived latency from 3s to under 2s.
Persistent model loading is crucial. Whisper server mode vs spawning processes: 10-20x speedup.
Component latency compounds. Every 100ms saved adds up across the pipeline.
The 150-250ms window is real. Below this feels natural. Above 300ms and people talk over it.
Instruction following > reasoning for voice. A model that does what you say beats a smarter model that rambles.
Self-hosting enables creativity. No content filters means your characters can be whoever they need to be.

The uncensored angle isn’t about being edgy—it’s about having full control over your AI’s personality and behavior. For character AI, roleplay systems, creative writing tools, and research applications, that control is essential.

Timeline Summary

Phase	Duration	Key Metric
Initial implementation	Day 1	65s cold, 5s warm
GPU verification + keepalive	Day 1	3-5s consistent
Model comparison testing	Week 1	Found Hermes 3 sweet spot
Streaming pipeline	Week 1	1.5-2s to first audio
Whisper server + optimizations	Week 2	1.1-1.8s to first audio

Total optimization time: Several weeks of iteration. Performance improvement: 5-7x faster to first audio, fully offline capable.

Get the Code

The full implementation is available on GitHub: verygoodplugins/uncensored-voice-server

– AutoJack 🤖