Whisper Local Implementation Guide: High-Accuracy Speech Recognition on CPU Only¶
Fully Offline × No GPU Required × Real-Time — A practical guide to voice input with zero cloud transmission
Key Takeaways¶
- Complete private processing — Audio data never leaves your machine. Works in environments where cloud SaaS is prohibited by policy
- Fast without a GPU — whisper.cpp + quantized models deliver 4x+ the speed of the original Whisper on CPU alone
- Real-time speech recognition — Combined with VAD (Voice Activity Detection), transcription starts while you're still speaking
- Windows integration — Insert text into any app with a single hotkey, delivering a low-friction UX similar to Win+H
📖 Overview¶
Whisper in 2025¶
OpenAI's Whisper is widely used as a high-accuracy, multilingual speech recognition model. As of 2025, the Whisper ecosystem has evolved significantly, and there are now several options that run at practical speeds on CPU only.
| Runtime | Language | Characteristics | Recommended Use |
|---|---|---|---|
| whisper.cpp | C/C++ | GGUF quantization, lightest footprint | Embedded in desktop apps |
| faster-whisper | Python (CTranslate2) | int8 quantization, strong at batch processing | Server-side / scripts |
| Sherpa-ONNX | C++/Python/C# | Supports SenseVoice & Moonshine, multilingual | When multi-model switching is needed |
This article covers best practices for local Whisper implementation based on real-world experience building a production-quality Windows voice input app.
Model Selection Guide¶
| Model | Size | Japanese Accuracy | CPU Inference (3s audio) | Recommended For |
|---|---|---|---|---|
| tiny | 39 MB | △ | ~0.5s | Prototyping |
| base | 74 MB | △–○ | ~0.8s | Low-spec devices |
| small | 244 MB | ○ | ~1.5s | Balanced |
| medium (q5 quantized) | ~500 MB | ◎ | ~2.0s | Recommended (CPU sweet spot) |
| large-v3-turbo | 809 MB | ◎◎ | ~3.0s | Accuracy-first |
| large-v3 (q5 quantized) | ~1.1 GB | ◎◎ | ~3.5s | Maximum accuracy |
Field knowledge
The q5-quantized medium model offers the best cost-performance ratio on CPU. Japanese recognition accuracy stays above 90% while completing inference within 2 seconds on a 4-core CPU.
🔧 Implementation¶
Step 1: Environment Setup¶
Python Environment (for faster-whisper)¶
# Python 3.10+ recommended
python -m venv whisper-env
source whisper-env/bin/activate # Windows: whisper-env\Scripts\activate
# Core packages
pip install faster-whisper numpy sounddevice
# FFmpeg (required for audio file conversion)
# Windows: winget install Gyan.FFmpeg
# macOS: brew install ffmpeg
# Linux: sudo apt install ffmpeg
.NET Environment (for whisper.cpp / Whisper.net)¶
For embedding in a desktop app, the C# binding Whisper.net is the leading option.
<!-- NuGet packages -->
<PackageReference Include="Whisper.net" Version="1.9.0" />
<PackageReference Include="Whisper.net.Runtime" Version="1.9.0" />
<!-- Only if GPU support is needed -->
<PackageReference Include="Whisper.net.Runtime.Cuda.Windows" Version="1.9.0" />
When to use which
faster-whisper is suited for Python scripts and server-side use. Whisper.net is the better choice for embedding in Windows desktop apps.
Step 2: Basic Speech Recognition¶
Python (faster-whisper)¶
from faster_whisper import WhisperModel
# Load model (downloads on first run)
model = WhisperModel(
"medium", # Model size
device="cpu", # CPU only
compute_type="int8", # Quantization for speed
cpu_threads=4, # Thread count (match physical cores)
)
# Transcribe an audio file
segments, info = model.transcribe(
"audio.wav",
language="ja",
beam_size=5,
vad_filter=True, # Skip silent segments automatically
vad_parameters=dict(
min_silence_duration_ms=600, # Split on 600ms+ silence
),
)
print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
C# (Whisper.net)¶
using Whisper.net;
// Specify path to a pre-downloaded model file
var modelPath = @"C:\models\ggml-medium-q5_0.bin";
using var factory = WhisperFactory.FromPath(modelPath,
new WhisperFactoryOptions { UseGpu = false });
using var processor = factory.CreateBuilder()
.WithLanguage("ja")
.WithThreads(4)
.WithSegmentEventHandler(e =>
{
Console.WriteLine($"[{e.Start} - {e.End}] {e.Text}");
})
.Build();
// Process 16kHz mono float32 PCM data
var audioData = LoadAudioAsFloat32("audio.wav");
processor.Process(audioData);
Audio format requirements
whisper.cpp expects 16kHz, mono, float32 PCM data. When converting from WAV files, pay close attention to the sample rate and channel count.
Step 3: Real-Time Speech Recognition¶
Production-quality real-time recognition requires three components:
- Audio capture — Continuous mic input
- VAD (Voice Activity Detection) — Distinguish speech from silence
- Streaming inference — Incrementally transcribe accumulated audio
Architecture Overview¶
Mic input (16kHz)
│
▼
Audio capture ──── Chunk splitting (200-320ms)
│
▼
VAD ───────────── Silence → speech-end trigger
│
▼
ASR engine ─────── Partial result (real-time preview)
│ Final result (confirmed at speech end)
▼
Text output ────── Insert into app or display
Python: Real-Time Recognition¶
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import threading
class RealtimeRecognizer:
"""Real-time speech recognition with VAD"""
def __init__(self, model_size="medium", language="ja"):
self.model = WhisperModel(model_size, device="cpu", compute_type="int8")
self.language = language
self.sample_rate = 16000
self.chunk_duration = 0.3 # 300ms chunks
self.silence_threshold = 0.015 # RMS energy threshold
self.silence_duration = 0.8 # Finalize after 800ms silence
self._audio_buffer = []
self._silence_frames = 0
self._is_speaking = False
def _calculate_rms(self, audio: np.ndarray) -> float:
return float(np.sqrt(np.mean(audio ** 2)))
def _is_speech(self, audio: np.ndarray) -> bool:
return self._calculate_rms(audio) >= self.silence_threshold
def _process_audio(self, audio_data: np.ndarray) -> str | None:
if len(audio_data) < self.sample_rate * 0.5: # Ignore < 0.5s
return None
segments, _ = self.model.transcribe(
audio_data, language=self.language, beam_size=5, vad_filter=False,
)
texts = [s.text.strip() for s in segments if s.text.strip()]
return "".join(texts) if texts else None
def start(self, callback):
chunk_samples = int(self.sample_rate * self.chunk_duration)
silence_chunks = int(self.silence_duration / self.chunk_duration)
def audio_callback(indata, frames, time_info, status):
audio = indata[:, 0].copy() # Mono
if self._is_speech(audio):
self._audio_buffer.append(audio)
self._silence_frames = 0
self._is_speaking = True
elif self._is_speaking:
self._silence_frames += 1
self._audio_buffer.append(audio)
if self._silence_frames >= silence_chunks:
full_audio = np.concatenate(self._audio_buffer)
result = self._process_audio(full_audio)
if result:
callback(result)
self._audio_buffer = []
self._silence_frames = 0
self._is_speaking = False
with sd.InputStream(
samplerate=self.sample_rate, channels=1, dtype="float32",
blocksize=chunk_samples, callback=audio_callback,
):
print("🎙 Recording... press Ctrl+C to stop")
threading.Event().wait()
# Usage
recognizer = RealtimeRecognizer()
recognizer.start(lambda text: print(f"Result: {text}"))
C#: Thread-Safe ASR Engine¶
using System.Buffers;
using Whisper.net;
public sealed class WhisperAsrEngine : IDisposable
{
private readonly object _gate = new();
private WhisperFactory? _factory;
private WhisperProcessor? _processor;
private float[]? _buffer;
private int _bufferPos;
private string? _lastFinal;
private bool _disposed;
private const int SampleRate = 16000;
private static readonly int MaxSamples = SampleRate * 120;
public void Start(string modelPath, string language = "ja", int threads = 4)
{
lock (_gate)
{
_factory = WhisperFactory.FromPath(modelPath,
new WhisperFactoryOptions { UseGpu = false });
_processor = _factory.CreateBuilder()
.WithLanguage(language).WithThreads(threads)
.WithSegmentEventHandler(OnSegment).Build();
_buffer = ArrayPool<float>.Shared.Rent(MaxSamples);
_bufferPos = 0;
}
}
public void PushAudio(ReadOnlySpan<float> samples)
{
lock (_gate)
{
if (_buffer is null) return;
var count = Math.Min(samples.Length, MaxSamples - _bufferPos);
if (count <= 0) return;
samples[..count].CopyTo(_buffer.AsSpan(_bufferPos));
_bufferPos += count;
}
}
public string? GetFinalAndReset()
{
lock (_gate)
{
if (_processor is null || _buffer is null || _bufferPos == 0) return null;
_lastFinal = null;
var audio = new float[_bufferPos];
Array.Copy(_buffer, audio, _bufferPos);
_processor.Process(audio);
_bufferPos = 0;
return _lastFinal;
}
}
private void OnSegment(SegmentData e)
{
var text = e.Text?.Trim();
if (string.IsNullOrEmpty(text)) return;
_lastFinal = (_lastFinal is null) ? text : _lastFinal + text;
}
public void Dispose()
{
lock (_gate)
{
if (_disposed) return;
_disposed = true;
_processor?.Dispose();
_factory?.Dispose();
if (_buffer is not null)
{
ArrayPool<float>.Shared.Return(_buffer);
_buffer = null;
}
}
}
}
Design points
ArrayPool<float>.Shared minimizes GC pressure while holding up to 120 seconds of audio efficiently. lock prevents race conditions between the recording and inference threads. PushAudio is kept to a lightweight copy operation; heavy inference is centralized in GetFinalAndReset.
Step 4: VAD (Voice Activity Detection)¶
Energy-Based VAD (Lightweight and Practical)¶
import numpy as np
class EnergyVAD:
"""Speech detection via RMS energy and an envelope follower.
The envelope follower smooths out instantaneous noise spikes."""
def __init__(self, threshold=0.015, attack=0.2, release=0.05):
self.threshold = threshold
self.attack = attack
self.release = release
self.envelope = 0.0
def is_speech(self, frame: np.ndarray) -> bool:
rms = float(np.sqrt(np.mean(frame ** 2)))
if rms > self.envelope:
self.envelope += self.attack * (rms - self.envelope)
else:
self.envelope += self.release * (rms - self.envelope)
return self.envelope >= self.threshold
public sealed class SimpleEnergyVad
{
private readonly double _threshold;
private readonly double _attack;
private readonly double _release;
private double _envelope;
public SimpleEnergyVad(double threshold = 0.015, double attack = 0.2, double release = 0.05)
{
_threshold = threshold;
_attack = Math.Clamp(attack, 0, 1);
_release = Math.Clamp(release, 0, 1);
}
public bool IsSpeech(ReadOnlySpan<float> frame, int samples)
{
if (samples <= 0 || frame.IsEmpty) return false;
double sum = 0;
for (int i = 0; i < Math.Min(samples, frame.Length); i++)
sum += frame[i] * frame[i];
double rms = Math.Sqrt(sum / samples);
_envelope = rms > _envelope
? _envelope + _attack * (rms - _envelope)
: _envelope + _release * (rms - _envelope);
return _envelope >= _threshold;
}
}
Silence Tracking and Speech-End Detection¶
class SilenceTracker:
def __init__(self, silence_threshold_ms=800, frame_duration_ms=300):
self.max_silent_frames = silence_threshold_ms / frame_duration_ms
self.silent_frame_count = 0
def update(self, is_speech: bool) -> bool:
"""Returns True when speech has ended."""
if is_speech:
self.silent_frame_count = 0
return False
self.silent_frame_count += 1
return self.silent_frame_count >= self.max_silent_frames
def reset(self):
self.silent_frame_count = 0
Tuning the silence threshold
600–900ms is practical. Too short and sentences get cut mid-way; too long and responsiveness suffers. Make this configurable in a YAML settings file for easy tuning.
Step 5: Windows Integration — Hotkeys & Text Insertion¶
Hotkey Registration¶
import ctypes
MOD_CONTROL = 0x0002
MOD_ALT = 0x0001
VK_V = 0x56
ctypes.windll.user32.RegisterHotKey(None, 1, MOD_CONTROL | MOD_ALT, VK_V)
Clipboard-Based Text Insertion¶
The most compatible way to insert recognized text into any app is via clipboard paste.
import pyperclip
import keyboard
import time
def commit_text(text: str, restore_delay: float = 1.5):
"""Insert recognized text into the active app via clipboard.
Original clipboard contents are automatically restored."""
original = pyperclip.paste()
pyperclip.copy(text)
keyboard.send("ctrl+v")
time.sleep(restore_delay)
pyperclip.copy(original)
Production caveats
- If restoration happens too early, the clipboard gets overwritten before the paste completes
- Some apps (e.g., Remote Desktop) require falling back to SendInput
- Filter out self-generated Ctrl+V events in the hotkey hook using the injected flag
💡 Best Practices¶
1. CPU Optimization¶
# localvoice.yaml (settings file example)
asr:
engine: "whispercpp"
model_path: "C:\\ProgramData\\LocalVoice\\models\\medium-q5.gguf"
threads: 4 # Use physical core count (not logical)
use_gpu: false
frame_ms: 240
| CPU | Physical Cores | Recommended Threads | medium-q5 Inference (3s audio) |
|---|---|---|---|
| Core i5-1235U | 4P+8E | 4 | ~2.5s |
| Core i7-13700 | 8P+8E | 8 | ~1.2s |
| Ryzen 5 5600 | 6 | 6 | ~1.8s |
| Ryzen 7 7800X3D | 8 | 8 | ~1.0s |
Keep threads at or below physical core count
Setting it to the logical core count (including HT/SMT) can actually make inference slower due to context-switching overhead.
2. Improving Recognition Accuracy¶
Initial Prompt (Context Hint)¶
segments, _ = model.transcribe(
audio,
language="ja",
initial_prompt="Technical discussion about Kubernetes, Docker, and CI/CD pipelines.",
)
VAD Filter¶
segments, _ = model.transcribe(
audio, language="ja",
vad_filter=True,
vad_parameters=dict(
min_silence_duration_ms=600,
speech_pad_ms=200,
threshold=0.5,
),
)
3. Externalize Configuration¶
# localvoice.yaml
hotkey: "Ctrl+Alt+V"
mode: "hold_to_talk"
language: "ja"
asr:
engine: "whispercpp"
model_path: "models/medium-q5.gguf"
threads: 4
use_gpu: false
vad:
silence_ms: 800
energy_threshold: 0.015
commit:
mode: "clipboard"
restore_clipboard_ms: 1500
privacy:
keep_audio: false
keep_text_after_commit: false
🚀 Advanced Usage¶
Multi-Engine Support: Sherpa-ONNX¶
using SherpaOnnx;
var config = new OfflineRecognizerConfig();
config.ModelConfig.Tokens = @"models\sensevoice\tokens.txt";
config.ModelConfig.NumThreads = 4;
config.ModelConfig.Provider = "cpu";
config.ModelConfig.SenseVoice.Model = @"models\sensevoice\model.int8.onnx";
config.ModelConfig.SenseVoice.Language = "ja";
config.ModelConfig.SenseVoice.UseInverseTextNormalization = 1;
using var recognizer = new OfflineRecognizer(config);
using var stream = recognizer.CreateStream();
stream.AcceptWaveform(16000, audioSamples);
recognizer.Decode(stream);
Console.WriteLine(stream.Result.Text);
SenseVoice vs Whisper
SenseVoice is a lightweight model specialized for Japanese, English, and Chinese. It can be faster than Whisper for short utterances and includes built-in inverse text normalization (natural formatting of numbers and dates).
Batch Processing: Bulk File Conversion¶
from pathlib import Path
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cpu", compute_type="int8")
for audio_file in Path("./recordings").glob("*.wav"):
segments, _ = model.transcribe(str(audio_file), language="ja", vad_filter=True)
text = "".join(s.text for s in segments)
audio_file.with_suffix(".txt").write_text(text, encoding="utf-8")
print(f"✅ {audio_file.name} → {audio_file.stem}.txt")
⚠️ Troubleshooting¶
| Problem | Cause | Solution |
|---|---|---|
ModuleNotFoundError: No module named 'faster_whisper' | Not installed | pip install faster-whisper |
FileNotFoundError: ffmpeg | FFmpeg not installed | winget install Gyan.FFmpeg → add to PATH |
| Empty result / hallucination | Silent input | Enable VAD filter, adjust silence_threshold |
| Out of memory (OOM) | Model too large | Switch to small, or use quantized model (q5/q8) |
| Slow inference | Wrong thread count | Check physical cores: wmic cpu get NumberOfCores |
| Garbled language output | Wrong auto-detected language | Set language="ja" explicitly |
| Clipboard restore fails | Restore timing too fast | Increase restore_clipboard_ms to 2000+ |
NumPy Compatibility¶
pip install "numpy<2.0"
Hallucination Mitigation¶
Whisper sometimes outputs nonsense text in response to silence or ambient noise. Counter-measures:
def filter_hallucination(text: str) -> str | None:
if not text or len(text.strip()) < 2:
return None
if len(set(text.strip())) <= 2: # Repeated character pattern
return None
hallucination_patterns = ["Thank you for watching", "Subscribe", "Subtitles"]
if any(p in text for p in hallucination_patterns):
return None
return text
Priority order:
- Use VAD — don't feed silent segments to the ASR
- Ignore audio shorter than 0.5 seconds
- Skip chunks where RMS is below threshold
- Discard results that are very short or match repetition patterns
Summary¶
| Component | Recommended Setup |
|---|---|
| Runtime | whisper.cpp (desktop) / faster-whisper (scripts) |
| Model | medium-q5 (balanced) / large-v3-turbo (accuracy-first) |
| VAD | Energy-based (lightweight) + silence tracker |
| Text insertion | Clipboard paste (maximum compatibility) |
| Configuration | External YAML file |
| Privacy | No audio/text persistence |
For enterprise use
Fully offline voice input delivers the most value in security-sensitive environments. For MSI packaging, organization policy-based config locking, and Windows Event Log audit trails, see requirements.md in the project repository.