Skip to content

Whisper Local Implementation Guide: High-Accuracy Speech Recognition on CPU Only

Fully Offline × No GPU Required × Real-Time — A practical guide to voice input with zero cloud transmission

Key Takeaways

  • Complete private processing — Audio data never leaves your machine. Works in environments where cloud SaaS is prohibited by policy
  • Fast without a GPU — whisper.cpp + quantized models deliver 4x+ the speed of the original Whisper on CPU alone
  • Real-time speech recognition — Combined with VAD (Voice Activity Detection), transcription starts while you're still speaking
  • Windows integration — Insert text into any app with a single hotkey, delivering a low-friction UX similar to Win+H

📖 Overview

Whisper in 2025

OpenAI's Whisper is widely used as a high-accuracy, multilingual speech recognition model. As of 2025, the Whisper ecosystem has evolved significantly, and there are now several options that run at practical speeds on CPU only.

RuntimeLanguageCharacteristicsRecommended Use
whisper.cppC/C++GGUF quantization, lightest footprintEmbedded in desktop apps
faster-whisperPython (CTranslate2)int8 quantization, strong at batch processingServer-side / scripts
Sherpa-ONNXC++/Python/C#Supports SenseVoice & Moonshine, multilingualWhen multi-model switching is needed

This article covers best practices for local Whisper implementation based on real-world experience building a production-quality Windows voice input app.

Model Selection Guide

ModelSizeJapanese AccuracyCPU Inference (3s audio)Recommended For
tiny39 MB~0.5sPrototyping
base74 MB△–○~0.8sLow-spec devices
small244 MB~1.5sBalanced
medium (q5 quantized)~500 MB~2.0sRecommended (CPU sweet spot)
large-v3-turbo809 MB◎◎~3.0sAccuracy-first
large-v3 (q5 quantized)~1.1 GB◎◎~3.5sMaximum accuracy

Field knowledge

The q5-quantized medium model offers the best cost-performance ratio on CPU. Japanese recognition accuracy stays above 90% while completing inference within 2 seconds on a 4-core CPU.


🔧 Implementation

Step 1: Environment Setup

Python Environment (for faster-whisper)

# Python 3.10+ recommended
python -m venv whisper-env
source whisper-env/bin/activate  # Windows: whisper-env\Scripts\activate

# Core packages
pip install faster-whisper numpy sounddevice

# FFmpeg (required for audio file conversion)
# Windows: winget install Gyan.FFmpeg
# macOS:   brew install ffmpeg
# Linux:   sudo apt install ffmpeg

.NET Environment (for whisper.cpp / Whisper.net)

For embedding in a desktop app, the C# binding Whisper.net is the leading option.

<!-- NuGet packages -->
<PackageReference Include="Whisper.net" Version="1.9.0" />
<PackageReference Include="Whisper.net.Runtime" Version="1.9.0" />
<!-- Only if GPU support is needed -->
<PackageReference Include="Whisper.net.Runtime.Cuda.Windows" Version="1.9.0" />

When to use which

faster-whisper is suited for Python scripts and server-side use. Whisper.net is the better choice for embedding in Windows desktop apps.


Step 2: Basic Speech Recognition

Python (faster-whisper)

from faster_whisper import WhisperModel

# Load model (downloads on first run)
model = WhisperModel(
    "medium",               # Model size
    device="cpu",           # CPU only
    compute_type="int8",    # Quantization for speed
    cpu_threads=4,          # Thread count (match physical cores)
)

# Transcribe an audio file
segments, info = model.transcribe(
    "audio.wav",
    language="ja",
    beam_size=5,
    vad_filter=True,          # Skip silent segments automatically
    vad_parameters=dict(
        min_silence_duration_ms=600,  # Split on 600ms+ silence
    ),
)

print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

C# (Whisper.net)

using Whisper.net;

// Specify path to a pre-downloaded model file
var modelPath = @"C:\models\ggml-medium-q5_0.bin";

using var factory = WhisperFactory.FromPath(modelPath,
    new WhisperFactoryOptions { UseGpu = false });

using var processor = factory.CreateBuilder()
    .WithLanguage("ja")
    .WithThreads(4)
    .WithSegmentEventHandler(e =>
    {
        Console.WriteLine($"[{e.Start} - {e.End}] {e.Text}");
    })
    .Build();

// Process 16kHz mono float32 PCM data
var audioData = LoadAudioAsFloat32("audio.wav");
processor.Process(audioData);

Audio format requirements

whisper.cpp expects 16kHz, mono, float32 PCM data. When converting from WAV files, pay close attention to the sample rate and channel count.


Step 3: Real-Time Speech Recognition

Production-quality real-time recognition requires three components:

  1. Audio capture — Continuous mic input
  2. VAD (Voice Activity Detection) — Distinguish speech from silence
  3. Streaming inference — Incrementally transcribe accumulated audio

Architecture Overview

Mic input (16kHz)
  │
  ▼
Audio capture ──── Chunk splitting (200-320ms)
  │
  ▼
VAD ───────────── Silence → speech-end trigger
  │
  ▼
ASR engine ─────── Partial result (real-time preview)
  │                 Final result (confirmed at speech end)
  ▼
Text output ────── Insert into app or display

Python: Real-Time Recognition

import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import threading

class RealtimeRecognizer:
    """Real-time speech recognition with VAD"""

    def __init__(self, model_size="medium", language="ja"):
        self.model = WhisperModel(model_size, device="cpu", compute_type="int8")
        self.language = language
        self.sample_rate = 16000
        self.chunk_duration = 0.3      # 300ms chunks
        self.silence_threshold = 0.015  # RMS energy threshold
        self.silence_duration = 0.8     # Finalize after 800ms silence
        self._audio_buffer = []
        self._silence_frames = 0
        self._is_speaking = False

    def _calculate_rms(self, audio: np.ndarray) -> float:
        return float(np.sqrt(np.mean(audio ** 2)))

    def _is_speech(self, audio: np.ndarray) -> bool:
        return self._calculate_rms(audio) >= self.silence_threshold

    def _process_audio(self, audio_data: np.ndarray) -> str | None:
        if len(audio_data) < self.sample_rate * 0.5:  # Ignore < 0.5s
            return None
        segments, _ = self.model.transcribe(
            audio_data, language=self.language, beam_size=5, vad_filter=False,
        )
        texts = [s.text.strip() for s in segments if s.text.strip()]
        return "".join(texts) if texts else None

    def start(self, callback):
        chunk_samples = int(self.sample_rate * self.chunk_duration)
        silence_chunks = int(self.silence_duration / self.chunk_duration)

        def audio_callback(indata, frames, time_info, status):
            audio = indata[:, 0].copy()  # Mono
            if self._is_speech(audio):
                self._audio_buffer.append(audio)
                self._silence_frames = 0
                self._is_speaking = True
            elif self._is_speaking:
                self._silence_frames += 1
                self._audio_buffer.append(audio)
                if self._silence_frames >= silence_chunks:
                    full_audio = np.concatenate(self._audio_buffer)
                    result = self._process_audio(full_audio)
                    if result:
                        callback(result)
                    self._audio_buffer = []
                    self._silence_frames = 0
                    self._is_speaking = False

        with sd.InputStream(
            samplerate=self.sample_rate, channels=1, dtype="float32",
            blocksize=chunk_samples, callback=audio_callback,
        ):
            print("🎙 Recording... press Ctrl+C to stop")
            threading.Event().wait()

# Usage
recognizer = RealtimeRecognizer()
recognizer.start(lambda text: print(f"Result: {text}"))

C#: Thread-Safe ASR Engine

using System.Buffers;
using Whisper.net;

public sealed class WhisperAsrEngine : IDisposable
{
    private readonly object _gate = new();
    private WhisperFactory? _factory;
    private WhisperProcessor? _processor;
    private float[]? _buffer;
    private int _bufferPos;
    private string? _lastFinal;
    private bool _disposed;

    private const int SampleRate = 16000;
    private static readonly int MaxSamples = SampleRate * 120;

    public void Start(string modelPath, string language = "ja", int threads = 4)
    {
        lock (_gate)
        {
            _factory = WhisperFactory.FromPath(modelPath,
                new WhisperFactoryOptions { UseGpu = false });
            _processor = _factory.CreateBuilder()
                .WithLanguage(language).WithThreads(threads)
                .WithSegmentEventHandler(OnSegment).Build();
            _buffer = ArrayPool<float>.Shared.Rent(MaxSamples);
            _bufferPos = 0;
        }
    }

    public void PushAudio(ReadOnlySpan<float> samples)
    {
        lock (_gate)
        {
            if (_buffer is null) return;
            var count = Math.Min(samples.Length, MaxSamples - _bufferPos);
            if (count <= 0) return;
            samples[..count].CopyTo(_buffer.AsSpan(_bufferPos));
            _bufferPos += count;
        }
    }

    public string? GetFinalAndReset()
    {
        lock (_gate)
        {
            if (_processor is null || _buffer is null || _bufferPos == 0) return null;
            _lastFinal = null;
            var audio = new float[_bufferPos];
            Array.Copy(_buffer, audio, _bufferPos);
            _processor.Process(audio);
            _bufferPos = 0;
            return _lastFinal;
        }
    }

    private void OnSegment(SegmentData e)
    {
        var text = e.Text?.Trim();
        if (string.IsNullOrEmpty(text)) return;
        _lastFinal = (_lastFinal is null) ? text : _lastFinal + text;
    }

    public void Dispose()
    {
        lock (_gate)
        {
            if (_disposed) return;
            _disposed = true;
            _processor?.Dispose();
            _factory?.Dispose();
            if (_buffer is not null)
            {
                ArrayPool<float>.Shared.Return(_buffer);
                _buffer = null;
            }
        }
    }
}

Design points

ArrayPool<float>.Shared minimizes GC pressure while holding up to 120 seconds of audio efficiently. lock prevents race conditions between the recording and inference threads. PushAudio is kept to a lightweight copy operation; heavy inference is centralized in GetFinalAndReset.


Step 4: VAD (Voice Activity Detection)

Energy-Based VAD (Lightweight and Practical)

import numpy as np

class EnergyVAD:
    """Speech detection via RMS energy and an envelope follower.
    The envelope follower smooths out instantaneous noise spikes."""

    def __init__(self, threshold=0.015, attack=0.2, release=0.05):
        self.threshold = threshold
        self.attack = attack
        self.release = release
        self.envelope = 0.0

    def is_speech(self, frame: np.ndarray) -> bool:
        rms = float(np.sqrt(np.mean(frame ** 2)))
        if rms > self.envelope:
            self.envelope += self.attack * (rms - self.envelope)
        else:
            self.envelope += self.release * (rms - self.envelope)
        return self.envelope >= self.threshold
public sealed class SimpleEnergyVad
{
    private readonly double _threshold;
    private readonly double _attack;
    private readonly double _release;
    private double _envelope;

    public SimpleEnergyVad(double threshold = 0.015, double attack = 0.2, double release = 0.05)
    {
        _threshold = threshold;
        _attack = Math.Clamp(attack, 0, 1);
        _release = Math.Clamp(release, 0, 1);
    }

    public bool IsSpeech(ReadOnlySpan<float> frame, int samples)
    {
        if (samples <= 0 || frame.IsEmpty) return false;
        double sum = 0;
        for (int i = 0; i < Math.Min(samples, frame.Length); i++)
            sum += frame[i] * frame[i];
        double rms = Math.Sqrt(sum / samples);
        _envelope = rms > _envelope
            ? _envelope + _attack * (rms - _envelope)
            : _envelope + _release * (rms - _envelope);
        return _envelope >= _threshold;
    }
}

Silence Tracking and Speech-End Detection

class SilenceTracker:
    def __init__(self, silence_threshold_ms=800, frame_duration_ms=300):
        self.max_silent_frames = silence_threshold_ms / frame_duration_ms
        self.silent_frame_count = 0

    def update(self, is_speech: bool) -> bool:
        """Returns True when speech has ended."""
        if is_speech:
            self.silent_frame_count = 0
            return False
        self.silent_frame_count += 1
        return self.silent_frame_count >= self.max_silent_frames

    def reset(self):
        self.silent_frame_count = 0

Tuning the silence threshold

600–900ms is practical. Too short and sentences get cut mid-way; too long and responsiveness suffers. Make this configurable in a YAML settings file for easy tuning.


Step 5: Windows Integration — Hotkeys & Text Insertion

Hotkey Registration

import ctypes

MOD_CONTROL = 0x0002
MOD_ALT = 0x0001
VK_V = 0x56

ctypes.windll.user32.RegisterHotKey(None, 1, MOD_CONTROL | MOD_ALT, VK_V)

Clipboard-Based Text Insertion

The most compatible way to insert recognized text into any app is via clipboard paste.

import pyperclip
import keyboard
import time

def commit_text(text: str, restore_delay: float = 1.5):
    """Insert recognized text into the active app via clipboard.
    Original clipboard contents are automatically restored."""
    original = pyperclip.paste()
    pyperclip.copy(text)
    keyboard.send("ctrl+v")
    time.sleep(restore_delay)
    pyperclip.copy(original)

Production caveats

  • If restoration happens too early, the clipboard gets overwritten before the paste completes
  • Some apps (e.g., Remote Desktop) require falling back to SendInput
  • Filter out self-generated Ctrl+V events in the hotkey hook using the injected flag

💡 Best Practices

1. CPU Optimization

# localvoice.yaml (settings file example)
asr:
  engine: "whispercpp"
  model_path: "C:\\ProgramData\\LocalVoice\\models\\medium-q5.gguf"
  threads: 4          # Use physical core count (not logical)
  use_gpu: false
  frame_ms: 240
CPUPhysical CoresRecommended Threadsmedium-q5 Inference (3s audio)
Core i5-1235U4P+8E4~2.5s
Core i7-137008P+8E8~1.2s
Ryzen 5 560066~1.8s
Ryzen 7 7800X3D88~1.0s

Keep threads at or below physical core count

Setting it to the logical core count (including HT/SMT) can actually make inference slower due to context-switching overhead.

2. Improving Recognition Accuracy

Initial Prompt (Context Hint)

segments, _ = model.transcribe(
    audio,
    language="ja",
    initial_prompt="Technical discussion about Kubernetes, Docker, and CI/CD pipelines.",
)

VAD Filter

segments, _ = model.transcribe(
    audio, language="ja",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=600,
        speech_pad_ms=200,
        threshold=0.5,
    ),
)

3. Externalize Configuration

# localvoice.yaml
hotkey: "Ctrl+Alt+V"
mode: "hold_to_talk"
language: "ja"

asr:
  engine: "whispercpp"
  model_path: "models/medium-q5.gguf"
  threads: 4
  use_gpu: false

vad:
  silence_ms: 800
  energy_threshold: 0.015

commit:
  mode: "clipboard"
  restore_clipboard_ms: 1500

privacy:
  keep_audio: false
  keep_text_after_commit: false

🚀 Advanced Usage

Multi-Engine Support: Sherpa-ONNX

using SherpaOnnx;

var config = new OfflineRecognizerConfig();
config.ModelConfig.Tokens = @"models\sensevoice\tokens.txt";
config.ModelConfig.NumThreads = 4;
config.ModelConfig.Provider = "cpu";
config.ModelConfig.SenseVoice.Model = @"models\sensevoice\model.int8.onnx";
config.ModelConfig.SenseVoice.Language = "ja";
config.ModelConfig.SenseVoice.UseInverseTextNormalization = 1;

using var recognizer = new OfflineRecognizer(config);
using var stream = recognizer.CreateStream();
stream.AcceptWaveform(16000, audioSamples);
recognizer.Decode(stream);
Console.WriteLine(stream.Result.Text);

SenseVoice vs Whisper

SenseVoice is a lightweight model specialized for Japanese, English, and Chinese. It can be faster than Whisper for short utterances and includes built-in inverse text normalization (natural formatting of numbers and dates).

Batch Processing: Bulk File Conversion

from pathlib import Path
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")

for audio_file in Path("./recordings").glob("*.wav"):
    segments, _ = model.transcribe(str(audio_file), language="ja", vad_filter=True)
    text = "".join(s.text for s in segments)
    audio_file.with_suffix(".txt").write_text(text, encoding="utf-8")
    print(f"✅ {audio_file.name}{audio_file.stem}.txt")

⚠️ Troubleshooting

ProblemCauseSolution
ModuleNotFoundError: No module named 'faster_whisper'Not installedpip install faster-whisper
FileNotFoundError: ffmpegFFmpeg not installedwinget install Gyan.FFmpeg → add to PATH
Empty result / hallucinationSilent inputEnable VAD filter, adjust silence_threshold
Out of memory (OOM)Model too largeSwitch to small, or use quantized model (q5/q8)
Slow inferenceWrong thread countCheck physical cores: wmic cpu get NumberOfCores
Garbled language outputWrong auto-detected languageSet language="ja" explicitly
Clipboard restore failsRestore timing too fastIncrease restore_clipboard_ms to 2000+

NumPy Compatibility

pip install "numpy<2.0"

Hallucination Mitigation

Whisper sometimes outputs nonsense text in response to silence or ambient noise. Counter-measures:

def filter_hallucination(text: str) -> str | None:
    if not text or len(text.strip()) < 2:
        return None
    if len(set(text.strip())) <= 2:  # Repeated character pattern
        return None
    hallucination_patterns = ["Thank you for watching", "Subscribe", "Subtitles"]
    if any(p in text for p in hallucination_patterns):
        return None
    return text

Priority order:

  1. Use VAD — don't feed silent segments to the ASR
  2. Ignore audio shorter than 0.5 seconds
  3. Skip chunks where RMS is below threshold
  4. Discard results that are very short or match repetition patterns

Summary

ComponentRecommended Setup
Runtimewhisper.cpp (desktop) / faster-whisper (scripts)
Modelmedium-q5 (balanced) / large-v3-turbo (accuracy-first)
VADEnergy-based (lightweight) + silence tracker
Text insertionClipboard paste (maximum compatibility)
ConfigurationExternal YAML file
PrivacyNo audio/text persistence

For enterprise use

Fully offline voice input delivers the most value in security-sensitive environments. For MSI packaging, organization policy-based config locking, and Windows Event Log audit trails, see requirements.md in the project repository.