Whisper Local Implementation Guide: High-Accuracy Speech Recognition on CPU Only¶

Fully Offline × No GPU Required × Real-Time — A practical guide to voice input with zero cloud transmission

Key Takeaways¶

Complete private processing — Audio data never leaves your machine. Works in environments where cloud SaaS is prohibited by policy
Fast without a GPU — whisper.cpp + quantized models deliver 4x+ the speed of the original Whisper on CPU alone
Real-time speech recognition — Combined with VAD (Voice Activity Detection), transcription starts while you're still speaking
Windows integration — Insert text into any app with a single hotkey, delivering a low-friction UX similar to Win+H

📖 Overview¶

Whisper in 2025¶

OpenAI's Whisper is widely used as a high-accuracy, multilingual speech recognition model. As of 2025, the Whisper ecosystem has evolved significantly, and there are now several options that run at practical speeds on CPU only.

Runtime	Language	Characteristics	Recommended Use
whisper.cpp	C/C++	GGUF quantization, lightest footprint	Embedded in desktop apps
faster-whisper	Python (CTranslate2)	int8 quantization, strong at batch processing	Server-side / scripts
Sherpa-ONNX	C++/Python/C#	Supports SenseVoice & Moonshine, multilingual	When multi-model switching is needed

This article covers best practices for local Whisper implementation based on real-world experience building a production-quality Windows voice input app.

Model Selection Guide¶

Model	Size	Japanese Accuracy	CPU Inference (3s audio)	Recommended For
tiny	39 MB	△	~0.5s	Prototyping
base	74 MB	△–○	~0.8s	Low-spec devices
small	244 MB	○	~1.5s	Balanced
medium (q5 quantized)	~500 MB	◎	~2.0s	Recommended (CPU sweet spot)
large-v3-turbo	809 MB	◎◎	~3.0s	Accuracy-first
large-v3 (q5 quantized)	~1.1 GB	◎◎	~3.5s	Maximum accuracy

Field knowledge

The q5-quantized medium model offers the best cost-performance ratio on CPU. Japanese recognition accuracy stays above 90% while completing inference within 2 seconds on a 4-core CPU.

🔧 Implementation¶

Step 1: Environment Setup¶

Python Environment (for faster-whisper)¶

# Python 3.10+ recommended
python -m venv whisper-env
source whisper-env/bin/activate  # Windows: whisper-env\Scripts\activate

# Core packages
pip install faster-whisper numpy sounddevice

# FFmpeg (required for audio file conversion)
# Windows: winget install Gyan.FFmpeg
# macOS:   brew install ffmpeg
# Linux:   sudo apt install ffmpeg

.NET Environment (for whisper.cpp / Whisper.net)¶

For embedding in a desktop app, the C# binding Whisper.net is the leading option.

<!-- NuGet packages -->
<PackageReference Include="Whisper.net" Version="1.9.0" />
<PackageReference Include="Whisper.net.Runtime" Version="1.9.0" />
<!-- Only if GPU support is needed -->
<PackageReference Include="Whisper.net.Runtime.Cuda.Windows" Version="1.9.0" />

When to use which

faster-whisper is suited for Python scripts and server-side use. Whisper.net is the better choice for embedding in Windows desktop apps.

Step 2: Basic Speech Recognition¶

Python (faster-whisper)¶

from faster_whisper import WhisperModel

# Load model (downloads on first run)
model = WhisperModel(
    "medium",               # Model size
    device="cpu",           # CPU only
    compute_type="int8",    # Quantization for speed
    cpu_threads=4,          # Thread count (match physical cores)
)

# Transcribe an audio file
segments, info = model.transcribe(
    "audio.wav",
    language="ja",
    beam_size=5,
    vad_filter=True,          # Skip silent segments automatically
    vad_parameters=dict(
        min_silence_duration_ms=600,  # Split on 600ms+ silence
    ),
)

print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

C# (Whisper.net)¶

using Whisper.net;

// Specify path to a pre-downloaded model file
var modelPath = @"C:\models\ggml-medium-q5_0.bin";

using var factory = WhisperFactory.FromPath(modelPath,
    new WhisperFactoryOptions { UseGpu = false });

using var processor = factory.CreateBuilder()
    .WithLanguage("ja")
    .WithThreads(4)
    .WithSegmentEventHandler(e =>
    {
        Console.WriteLine($"[{e.Start} - {e.End}] {e.Text}");
    })
    .Build();

// Process 16kHz mono float32 PCM data
var audioData = LoadAudioAsFloat32("audio.wav");
processor.Process(audioData);

Audio format requirements

whisper.cpp expects 16kHz, mono, float32 PCM data. When converting from WAV files, pay close attention to the sample rate and channel count.

Step 3: Real-Time Speech Recognition¶

Production-quality real-time recognition requires three components:

Audio capture — Continuous mic input
VAD (Voice Activity Detection) — Distinguish speech from silence
Streaming inference — Incrementally transcribe accumulated audio

Architecture Overview¶

Mic input (16kHz)
  │
  ▼
Audio capture ──── Chunk splitting (200-320ms)
  │
  ▼
VAD ───────────── Silence → speech-end trigger
  │
  ▼
ASR engine ─────── Partial result (real-time preview)
  │                 Final result (confirmed at speech end)
  ▼
Text output ────── Insert into app or display

Python: Real-Time Recognition¶

import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import threading

class RealtimeRecognizer:
    """Real-time speech recognition with VAD"""

    def __init__(self, model_size="medium", language="ja"):
        self.model = WhisperModel(model_size, device="cpu", compute_type="int8")
        self.language = language
        self.sample_rate = 16000
        self.chunk_duration = 0.3      # 300ms chunks
        self.silence_threshold = 0.015  # RMS energy threshold
        self.silence_duration = 0.8     # Finalize after 800ms silence
        self._audio_buffer = []
        self._silence_frames = 0
        self._is_speaking = False

    def _calculate_rms(self, audio: np.ndarray) -> float:
        return float(np.sqrt(np.mean(audio ** 2)))

    def _is_speech(self, audio: np.ndarray) -> bool:
        return self._calculate_rms(audio) >= self.silence_threshold

    def _process_audio(self, audio_data: np.ndarray) -> str | None:
        if len(audio_data) < self.sample_rate * 0.5:  # Ignore < 0.5s
            return None
        segments, _ = self.model.transcribe(
            audio_data, language=self.language, beam_size=5, vad_filter=False,
        )
        texts = [s.text.strip() for s in segments if s.text.strip()]
        return "".join(texts) if texts else None

    def start(self, callback):
        chunk_samples = int(self.sample_rate * self.chunk_duration)
        silence_chunks = int(self.silence_duration / self.chunk_duration)

        def audio_callback(indata, frames, time_info, status):
            audio = indata[:, 0].copy()  # Mono
            if self._is_speech(audio):
                self._audio_buffer.append(audio)
                self._silence_frames = 0
                self._is_speaking = True
            elif self._is_speaking:
                self._silence_frames += 1
                self._audio_buffer.append(audio)
                if self._silence_frames >= silence_chunks:
                    full_audio = np.concatenate(self._audio_buffer)
                    result = self._process_audio(full_audio)
                    if result:
                        callback(result)
                    self._audio_buffer = []
                    self._silence_frames = 0
                    self._is_speaking = False

        with sd.InputStream(
            samplerate=self.sample_rate, channels=1, dtype="float32",
            blocksize=chunk_samples, callback=audio_callback,
        ):
            print("🎙 Recording... press Ctrl+C to stop")
            threading.Event().wait()

# Usage
recognizer = RealtimeRecognizer()
recognizer.start(lambda text: print(f"Result: {text}"))

C#: Thread-Safe ASR Engine¶

using System.Buffers;
using Whisper.net;

public sealed class WhisperAsrEngine : IDisposable
{
    private readonly object _gate = new();
    private WhisperFactory? _factory;
    private WhisperProcessor? _processor;
    private float[]? _buffer;
    private int _bufferPos;
    private string? _lastFinal;
    private bool _disposed;

    private const int SampleRate = 16000;
    private static readonly int MaxSamples = SampleRate * 120;

    public void Start(string modelPath, string language = "ja", int threads = 4)
    {
        lock (_gate)
        {
            _factory = WhisperFactory.FromPath(modelPath,
                new WhisperFactoryOptions { UseGpu = false });
            _processor = _factory.CreateBuilder()
                .WithLanguage(language).WithThreads(threads)
                .WithSegmentEventHandler(OnSegment).Build();
            _buffer = ArrayPool<float>.Shared.Rent(MaxSamples);
            _bufferPos = 0;
        }
    }

    public void PushAudio(ReadOnlySpan<float> samples)
    {
        lock (_gate)
        {
            if (_buffer is null) return;
            var count = Math.Min(samples.Length, MaxSamples - _bufferPos);
            if (count <= 0) return;
            samples[..count].CopyTo(_buffer.AsSpan(_bufferPos));
            _bufferPos += count;
        }
    }

    public string? GetFinalAndReset()
    {
        lock (_gate)
        {
            if (_processor is null || _buffer is null || _bufferPos == 0) return null;
            _lastFinal = null;
            var audio = new float[_bufferPos];
            Array.Copy(_buffer, audio, _bufferPos);
            _processor.Process(audio);
            _bufferPos = 0;
            return _lastFinal;
        }
    }

    private void OnSegment(SegmentData e)
    {
        var text = e.Text?.Trim();
        if (string.IsNullOrEmpty(text)) return;
        _lastFinal = (_lastFinal is null) ? text : _lastFinal + text;
    }

    public void Dispose()
    {
        lock (_gate)
        {
            if (_disposed) return;
            _disposed = true;
            _processor?.Dispose();
            _factory?.Dispose();
            if (_buffer is not null)
            {
                ArrayPool<float>.Shared.Return(_buffer);
                _buffer = null;
            }
        }
    }
}

Design points

ArrayPool<float>.Shared minimizes GC pressure while holding up to 120 seconds of audio efficiently. lock prevents race conditions between the recording and inference threads. PushAudio is kept to a lightweight copy operation; heavy inference is centralized in GetFinalAndReset.

Step 4: VAD (Voice Activity Detection)¶

Energy-Based VAD (Lightweight and Practical)¶

import numpy as np

class EnergyVAD:
    """Speech detection via RMS energy and an envelope follower.
    The envelope follower smooths out instantaneous noise spikes."""

    def __init__(self, threshold=0.015, attack=0.2, release=0.05):
        self.threshold = threshold
        self.attack = attack
        self.release = release
        self.envelope = 0.0

    def is_speech(self, frame: np.ndarray) -> bool:
        rms = float(np.sqrt(np.mean(frame ** 2)))
        if rms > self.envelope:
            self.envelope += self.attack * (rms - self.envelope)
        else:
            self.envelope += self.release * (rms - self.envelope)
        return self.envelope >= self.threshold

public sealed class SimpleEnergyVad
{
    private readonly double _threshold;
    private readonly double _attack;
    private readonly double _release;
    private double _envelope;

    public SimpleEnergyVad(double threshold = 0.015, double attack = 0.2, double release = 0.05)
    {
        _threshold = threshold;
        _attack = Math.Clamp(attack, 0, 1);
        _release = Math.Clamp(release, 0, 1);
    }

    public bool IsSpeech(ReadOnlySpan<float> frame, int samples)
    {
        if (samples <= 0 || frame.IsEmpty) return false;
        double sum = 0;
        for (int i = 0; i < Math.Min(samples, frame.Length); i++)
            sum += frame[i] * frame[i];
        double rms = Math.Sqrt(sum / samples);
        _envelope = rms > _envelope
            ? _envelope + _attack * (rms - _envelope)
            : _envelope + _release * (rms - _envelope);
        return _envelope >= _threshold;
    }
}

Silence Tracking and Speech-End Detection¶

class SilenceTracker:
    def __init__(self, silence_threshold_ms=800, frame_duration_ms=300):
        self.max_silent_frames = silence_threshold_ms / frame_duration_ms
        self.silent_frame_count = 0

    def update(self, is_speech: bool) -> bool:
        """Returns True when speech has ended."""
        if is_speech:
            self.silent_frame_count = 0
            return False
        self.silent_frame_count += 1
        return self.silent_frame_count >= self.max_silent_frames

    def reset(self):
        self.silent_frame_count = 0

Tuning the silence threshold

600–900ms is practical. Too short and sentences get cut mid-way; too long and responsiveness suffers. Make this configurable in a YAML settings file for easy tuning.

Step 5: Windows Integration — Hotkeys & Text Insertion¶

Hotkey Registration¶

import ctypes

MOD_CONTROL = 0x0002
MOD_ALT = 0x0001
VK_V = 0x56

ctypes.windll.user32.RegisterHotKey(None, 1, MOD_CONTROL | MOD_ALT, VK_V)

Clipboard-Based Text Insertion¶

The most compatible way to insert recognized text into any app is via clipboard paste.

import pyperclip
import keyboard
import time

def commit_text(text: str, restore_delay: float = 1.5):
    """Insert recognized text into the active app via clipboard.
    Original clipboard contents are automatically restored."""
    original = pyperclip.paste()
    pyperclip.copy(text)
    keyboard.send("ctrl+v")
    time.sleep(restore_delay)
    pyperclip.copy(original)

Production caveats

If restoration happens too early, the clipboard gets overwritten before the paste completes
Some apps (e.g., Remote Desktop) require falling back to SendInput
Filter out self-generated Ctrl+V events in the hotkey hook using the injected flag

💡 Best Practices¶

1. CPU Optimization¶

# localvoice.yaml (settings file example)
asr:
  engine: "whispercpp"
  model_path: "C:\\ProgramData\\LocalVoice\\models\\medium-q5.gguf"
  threads: 4          # Use physical core count (not logical)
  use_gpu: false
  frame_ms: 240

CPU	Physical Cores	Recommended Threads	medium-q5 Inference (3s audio)
Core i5-1235U	4P+8E	4	~2.5s
Core i7-13700	8P+8E	8	~1.2s
Ryzen 5 5600	6	6	~1.8s
Ryzen 7 7800X3D	8	8	~1.0s

Keep threads at or below physical core count

Setting it to the logical core count (including HT/SMT) can actually make inference slower due to context-switching overhead.

2. Improving Recognition Accuracy¶

Initial Prompt (Context Hint)¶

segments, _ = model.transcribe(
    audio,
    language="ja",
    initial_prompt="Technical discussion about Kubernetes, Docker, and CI/CD pipelines.",
)

VAD Filter¶

segments, _ = model.transcribe(
    audio, language="ja",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=600,
        speech_pad_ms=200,
        threshold=0.5,
    ),
)

3. Externalize Configuration¶

# localvoice.yaml
hotkey: "Ctrl+Alt+V"
mode: "hold_to_talk"
language: "ja"

asr:
  engine: "whispercpp"
  model_path: "models/medium-q5.gguf"
  threads: 4
  use_gpu: false

vad:
  silence_ms: 800
  energy_threshold: 0.015

commit:
  mode: "clipboard"
  restore_clipboard_ms: 1500

privacy:
  keep_audio: false
  keep_text_after_commit: false

🚀 Advanced Usage¶

Multi-Engine Support: Sherpa-ONNX¶

using SherpaOnnx;

var config = new OfflineRecognizerConfig();
config.ModelConfig.Tokens = @"models\sensevoice\tokens.txt";
config.ModelConfig.NumThreads = 4;
config.ModelConfig.Provider = "cpu";
config.ModelConfig.SenseVoice.Model = @"models\sensevoice\model.int8.onnx";
config.ModelConfig.SenseVoice.Language = "ja";
config.ModelConfig.SenseVoice.UseInverseTextNormalization = 1;

using var recognizer = new OfflineRecognizer(config);
using var stream = recognizer.CreateStream();
stream.AcceptWaveform(16000, audioSamples);
recognizer.Decode(stream);
Console.WriteLine(stream.Result.Text);

SenseVoice vs Whisper

SenseVoice is a lightweight model specialized for Japanese, English, and Chinese. It can be faster than Whisper for short utterances and includes built-in inverse text normalization (natural formatting of numbers and dates).

Batch Processing: Bulk File Conversion¶

from pathlib import Path
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")

for audio_file in Path("./recordings").glob("*.wav"):
    segments, _ = model.transcribe(str(audio_file), language="ja", vad_filter=True)
    text = "".join(s.text for s in segments)
    audio_file.with_suffix(".txt").write_text(text, encoding="utf-8")
    print(f"✅ {audio_file.name} → {audio_file.stem}.txt")

⚠️ Troubleshooting¶

Problem	Cause	Solution
`ModuleNotFoundError: No module named 'faster_whisper'`	Not installed	`pip install faster-whisper`
`FileNotFoundError: ffmpeg`	FFmpeg not installed	`winget install Gyan.FFmpeg` → add to PATH
Empty result / hallucination	Silent input	Enable VAD filter, adjust `silence_threshold`
Out of memory (OOM)	Model too large	Switch to `small`, or use quantized model (q5/q8)
Slow inference	Wrong thread count	Check physical cores: `wmic cpu get NumberOfCores`
Garbled language output	Wrong auto-detected language	Set `language="ja"` explicitly
Clipboard restore fails	Restore timing too fast	Increase `restore_clipboard_ms` to 2000+

NumPy Compatibility¶

pip install "numpy<2.0"

Hallucination Mitigation¶

Whisper sometimes outputs nonsense text in response to silence or ambient noise. Counter-measures:

def filter_hallucination(text: str) -> str | None:
    if not text or len(text.strip()) < 2:
        return None
    if len(set(text.strip())) <= 2:  # Repeated character pattern
        return None
    hallucination_patterns = ["Thank you for watching", "Subscribe", "Subtitles"]
    if any(p in text for p in hallucination_patterns):
        return None
    return text

Priority order:

Use VAD — don't feed silent segments to the ASR
Ignore audio shorter than 0.5 seconds
Skip chunks where RMS is below threshold
Discard results that are very short or match repetition patterns

Summary¶

Component	Recommended Setup
Runtime	whisper.cpp (desktop) / faster-whisper (scripts)
Model	medium-q5 (balanced) / large-v3-turbo (accuracy-first)
VAD	Energy-based (lightweight) + silence tracker
Text insertion	Clipboard paste (maximum compatibility)
Configuration	External YAML file
Privacy	No audio/text persistence

For enterprise use

Fully offline voice input delivers the most value in security-sensitive environments. For MSI packaging, organization policy-based config locking, and Windows Event Log audit trails, see requirements.md in the project repository.

Voice Input × Claude Code Practical Guide