Whisperローカル実装完全ガイド：CPUオンリーで実現する高精度音声認識¶

完全オフライン × GPU不要 × リアルタイム — クラウド送信なしで音声入力を実現する実践ガイド

この記事のポイント¶

完全プライベート処理 — 音声データは一切外部送信されない。社内規定でクラウドSaaS禁止の環境でも利用可能
GPU不要で高速処理 — whisper.cpp + 量子化モデルにより、CPUのみでオリジナルWhisperの4倍以上の速度を実現
リアルタイム音声認識 — VAD（音声区間検出）と組み合わせ、発話中から逐次テキスト化
Windows統合 — ホットキー一つで任意アプリへテキスト挿入。Win+Hライクな低摩擦UXを実現

📖 Overview¶

2025年のWhisper音声認識¶

OpenAIが公開したWhisperは、多言語対応の高精度音声認識モデルとして広く活用されている。2025年現在、Whisperエコシステムは大きく進化し、CPUのみで実用的な速度で動作する選択肢が複数登場している。

ランタイム	言語	特徴	推奨用途
whisper.cpp	C/C++	GGUF量子化対応、最軽量	デスクトップアプリ組込み
faster-whisper	Python (CTranslate2)	int8量子化、バッチ処理に強い	サーバーサイド／スクリプト
Sherpa-ONNX	C++/Python/C#	SenseVoice・Moonshine対応、多言語	マルチモデル切替が必要な場合

本記事では、実際にプロダクション品質のWindows音声入力アプリを開発した経験をもとに、ローカルWhisper実装のベストプラクティスを解説する。

モデル選定ガイド¶

モデル	サイズ	日本語精度	CPU推論速度（3秒音声）	推奨環境
tiny	39 MB	△	~0.5秒	プロトタイピング
base	74 MB	△〜○	~0.8秒	軽量デバイス
small	244 MB	○	~1.5秒	バランス型
medium（q5量子化）	~500 MB	◎	~2.0秒	推奨（CPU実用ライン）
large-v3-turbo	809 MB	◎◎	~3.0秒	高精度重視
large-v3（q5量子化）	~1.1 GB	◎◎	~3.5秒	最高精度

実運用での知見

mediumモデルのq5量子化版がCPU環境でのコストパフォーマンス最良。日本語認識精度は90%以上を維持しつつ、4コアCPUでも2秒以内に処理完了する。

🔧 Implementation¶

Step 1: 環境構築¶

Python環境（faster-whisper利用の場合）¶

# Python 3.10+ 推奨
python -m venv whisper-env
source whisper-env/bin/activate  # Windows: whisper-env\Scripts\activate

# 基本パッケージ
pip install faster-whisper numpy sounddevice

# FFmpegのインストール（音声ファイル変換に必要）
# Windows: winget install Gyan.FFmpeg
# macOS:   brew install ffmpeg
# Linux:   sudo apt install ffmpeg

.NET環境（whisper.cpp / Whisper.net利用の場合）¶

デスクトップアプリに組み込む場合は、C#バインディングのWhisper.netが有力な選択肢。

<!-- NuGet パッケージ -->
<PackageReference Include="Whisper.net" Version="1.9.0" />
<PackageReference Include="Whisper.net.Runtime" Version="1.9.0" />
<!-- GPU対応が必要な場合のみ -->
<PackageReference Include="Whisper.net.Runtime.Cuda.Windows" Version="1.9.0" />

使い分けの目安

faster-whisperはPythonスクリプトやサーバー用途に、Whisper.netはWindowsデスクトップアプリへの組込みに適している。

Step 2: 基本的な音声認識¶

Python版（faster-whisper）¶

from faster_whisper import WhisperModel

# モデル読み込み（初回はダウンロードが発生）
model = WhisperModel(
    "medium",               # モデルサイズ
    device="cpu",           # CPUのみ使用
    compute_type="int8",    # 量子化で高速化
    cpu_threads=4,          # CPUスレッド数（物理コア数を推奨）
)

# 音声ファイルの認識
segments, info = model.transcribe(
    "audio.wav",
    language="ja",
    beam_size=5,
    vad_filter=True,          # VADで無音区間を自動スキップ
    vad_parameters=dict(
        min_silence_duration_ms=600,  # 600ms以上の無音で区切り
    ),
)

print(f"検出言語: {info.language} (確率: {info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

C#版（Whisper.net）¶

using Whisper.net;

// モデルファイルのパスを指定（事前にダウンロード済み）
var modelPath = @"C:\models\ggml-medium-q5_0.bin";

using var factory = WhisperFactory.FromPath(modelPath,
    new WhisperFactoryOptions { UseGpu = false });

using var processor = factory.CreateBuilder()
    .WithLanguage("ja")
    .WithThreads(4)
    .WithSegmentEventHandler(e =>
    {
        Console.WriteLine($"[{e.Start} - {e.End}] {e.Text}");
    })
    .Build();

// 16kHz mono float32 PCM データを処理
var audioData = LoadAudioAsFloat32("audio.wav");
processor.Process(audioData);

音声フォーマットの注意

whisper.cppは16kHz、モノラル、float32のPCMデータを要求する。WAVファイルから変換する場合はサンプルレートとチャンネル数に注意すること。

Step 3: リアルタイム音声認識¶

プロダクション品質のリアルタイム認識には、以下の3つの要素が必要。

音声キャプチャ — マイク入力をリアルタイムに取得
VAD（音声区間検出） — 発話区間と無音区間を判定
ストリーミング推論 — 蓄積した音声を逐次認識

アーキテクチャ概要¶

マイク入力 (16kHz)
  │
  ▼
音声キャプチャ ──── チャンク分割（200-320ms）
  │
  ▼
VAD（音声区間検出）─── 無音判定 → 発話終了トリガー
  │
  ▼
ASRエンジン ─────── 部分結果（リアルタイムプレビュー）
  │                  最終結果（発話終了時に確定）
  ▼
テキスト出力 ──── アプリへ挿入 or 表示

Python版：リアルタイム認識¶

import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import threading

class RealtimeRecognizer:
    """VAD付きリアルタイム音声認識"""

    def __init__(self, model_size="medium", language="ja"):
        self.model = WhisperModel(model_size, device="cpu", compute_type="int8")
        self.language = language
        self.sample_rate = 16000
        self.chunk_duration = 0.3      # 300msチャンク
        self.silence_threshold = 0.015  # RMSエネルギー閾値
        self.silence_duration = 0.8     # 800ms無音で確定
        self._audio_buffer = []
        self._silence_frames = 0
        self._is_speaking = False

    def _calculate_rms(self, audio: np.ndarray) -> float:
        return float(np.sqrt(np.mean(audio ** 2)))

    def _is_speech(self, audio: np.ndarray) -> bool:
        return self._calculate_rms(audio) >= self.silence_threshold

    def _process_audio(self, audio_data: np.ndarray) -> str | None:
        if len(audio_data) < self.sample_rate * 0.5:
            return None
        segments, _ = self.model.transcribe(
            audio_data, language=self.language, beam_size=5, vad_filter=False,
        )
        texts = [s.text.strip() for s in segments if s.text.strip()]
        return "".join(texts) if texts else None

    def start(self, callback):
        chunk_samples = int(self.sample_rate * self.chunk_duration)
        silence_chunks = int(self.silence_duration / self.chunk_duration)

        def audio_callback(indata, frames, time_info, status):
            audio = indata[:, 0].copy()
            if self._is_speech(audio):
                self._audio_buffer.append(audio)
                self._silence_frames = 0
                self._is_speaking = True
            elif self._is_speaking:
                self._silence_frames += 1
                self._audio_buffer.append(audio)
                if self._silence_frames >= silence_chunks:
                    full_audio = np.concatenate(self._audio_buffer)
                    result = self._process_audio(full_audio)
                    if result:
                        callback(result)
                    self._audio_buffer = []
                    self._silence_frames = 0
                    self._is_speaking = False

        with sd.InputStream(
            samplerate=self.sample_rate, channels=1, dtype="float32",
            blocksize=chunk_samples, callback=audio_callback,
        ):
            print("🎙 録音中... Ctrl+C で停止")
            threading.Event().wait()

# 使用例
recognizer = RealtimeRecognizer()
recognizer.start(lambda text: print(f"認識結果: {text}"))

C#版：スレッドセーフなASRエンジン設計¶

using System.Buffers;
using Whisper.net;

public sealed class WhisperAsrEngine : IDisposable
{
    private readonly object _gate = new();
    private WhisperFactory? _factory;
    private WhisperProcessor? _processor;
    private float[]? _buffer;
    private int _bufferPos;
    private string? _lastFinal;
    private bool _disposed;

    private const int SampleRate = 16000;
    private static readonly int MaxSamples = SampleRate * 120;

    public void Start(string modelPath, string language = "ja", int threads = 4)
    {
        lock (_gate)
        {
            _factory = WhisperFactory.FromPath(modelPath,
                new WhisperFactoryOptions { UseGpu = false });
            _processor = _factory.CreateBuilder()
                .WithLanguage(language).WithThreads(threads)
                .WithSegmentEventHandler(OnSegment).Build();
            _buffer = ArrayPool<float>.Shared.Rent(MaxSamples);
            _bufferPos = 0;
        }
    }

    public void PushAudio(ReadOnlySpan<float> samples)
    {
        lock (_gate)
        {
            if (_buffer is null) return;
            var count = Math.Min(samples.Length, MaxSamples - _bufferPos);
            if (count <= 0) return;
            samples[..count].CopyTo(_buffer.AsSpan(_bufferPos));
            _bufferPos += count;
        }
    }

    public string? GetFinalAndReset()
    {
        lock (_gate)
        {
            if (_processor is null || _buffer is null || _bufferPos == 0) return null;
            _lastFinal = null;
            var audio = new float[_bufferPos];
            Array.Copy(_buffer, audio, _bufferPos);
            _processor.Process(audio);
            _bufferPos = 0;
            return _lastFinal;
        }
    }

    private void OnSegment(SegmentData e)
    {
        var text = e.Text?.Trim();
        if (string.IsNullOrEmpty(text)) return;
        _lastFinal = (_lastFinal is null) ? text : _lastFinal + text;
    }

    public void Dispose()
    {
        lock (_gate)
        {
            if (_disposed) return;
            _disposed = true;
            _processor?.Dispose();
            _factory?.Dispose();
            if (_buffer is not null)
            {
                ArrayPool<float>.Shared.Return(_buffer);
                _buffer = null;
            }
        }
    }
}

設計ポイント

ArrayPool<float>.SharedでGC圧力を低減。lockで録音スレッドと認識スレッドの競合を防止。PushAudioは軽量なコピー操作のみに留め、重い認識処理はGetFinalAndResetに集約する。

Step 4: VAD（音声区間検出）の実装¶

エネルギーベースVAD（軽量・実用的）¶

import numpy as np

class EnergyVAD:
    """RMSエネルギーとエンベロープフォロワーによる音声区間検出"""

    def __init__(self, threshold=0.015, attack=0.2, release=0.05):
        self.threshold = threshold
        self.attack = attack
        self.release = release
        self.envelope = 0.0

    def is_speech(self, frame: np.ndarray) -> bool:
        rms = float(np.sqrt(np.mean(frame ** 2)))
        if rms > self.envelope:
            self.envelope += self.attack * (rms - self.envelope)
        else:
            self.envelope += self.release * (rms - self.envelope)
        return self.envelope >= self.threshold

public sealed class SimpleEnergyVad
{
    private readonly double _threshold;
    private readonly double _attack;
    private readonly double _release;
    private double _envelope;

    public SimpleEnergyVad(double threshold = 0.015, double attack = 0.2, double release = 0.05)
    {
        _threshold = threshold;
        _attack = Math.Clamp(attack, 0, 1);
        _release = Math.Clamp(release, 0, 1);
    }

    public bool IsSpeech(ReadOnlySpan<float> frame, int samples)
    {
        if (samples <= 0 || frame.IsEmpty) return false;
        double sum = 0;
        for (int i = 0; i < Math.Min(samples, frame.Length); i++)
            sum += frame[i] * frame[i];
        double rms = Math.Sqrt(sum / samples);
        _envelope = rms > _envelope
            ? _envelope + _attack * (rms - _envelope)
            : _envelope + _release * (rms - _envelope);
        return _envelope >= _threshold;
    }
}

無音トラッキングと発話終了判定¶

class SilenceTracker:
    def __init__(self, silence_threshold_ms=800, frame_duration_ms=300):
        self.max_silent_frames = silence_threshold_ms / frame_duration_ms
        self.silent_frame_count = 0

    def update(self, is_speech: bool) -> bool:
        """Trueを返したら発話終了"""
        if is_speech:
            self.silent_frame_count = 0
            return False
        self.silent_frame_count += 1
        return self.silent_frame_count >= self.max_silent_frames

    def reset(self):
        self.silent_frame_count = 0

無音閾値の調整

600〜900msが実用的。短すぎると文の途中で途切れ、長すぎるとレスポンスが悪くなる。YAML設定ファイルで調整可能にしておくと良い。

Step 5: Windows統合 — ホットキー＆テキスト挿入¶

ホットキーによる起動¶

import ctypes

MOD_CONTROL = 0x0002
MOD_ALT = 0x0001
VK_V = 0x56

ctypes.windll.user32.RegisterHotKey(None, 1, MOD_CONTROL | MOD_ALT, VK_V)

クリップボード経由のテキスト挿入¶

import pyperclip
import keyboard
import time

def commit_text(text: str, restore_delay: float = 1.5):
    """認識テキストをクリップボード経由でアクティブアプリに挿入。元の内容は自動復元。"""
    original = pyperclip.paste()
    pyperclip.copy(text)
    keyboard.send("ctrl+v")
    time.sleep(restore_delay)
    pyperclip.copy(original)

プロダクションでの注意点

復元タイミングが早すぎると貼り付け前に上書きされる
リモートデスクトップ等ではSendInput方式へのフォールバックが必要
injectedフラグでホットキーフックの再捕捉を防ぐこと

💡 Best Practices¶

1. CPU最適化の実践¶

# localvoice.yaml（設定ファイル例）
asr:
  engine: "whispercpp"
  model_path: "C:\\ProgramData\\LocalVoice\\models\\medium-q5.gguf"
  threads: 4          # 物理コア数を推奨（論理コア数ではない）
  use_gpu: false
  frame_ms: 240

CPU	物理コア	推奨スレッド数	medium-q5推論速度（3秒音声）
Core i5-1235U	4P+8E	4	~2.5秒
Core i7-13700	8P+8E	8	~1.2秒
Ryzen 5 5600	6	6	~1.8秒
Ryzen 7 7800X3D	8	8	~1.0秒

threadsは物理コア数以下に

論理コア数（HT/SMT含む）に設定すると、コンテキストスイッチのオーバーヘッドで逆に遅くなることがある。

2. 認識精度の向上テクニック¶

Initial Prompt（コンテキストヒント）¶

segments, _ = model.transcribe(
    audio,
    language="ja",
    initial_prompt="技術的な議論をしています。Kubernetes、Docker、CI/CDパイプライン。",
)

VADフィルタの活用¶

segments, _ = model.transcribe(
    audio, language="ja",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=600,
        speech_pad_ms=200,
        threshold=0.5,
    ),
)

3. 設定の外部化¶

# localvoice.yaml
hotkey: "Ctrl+Alt+V"
mode: "hold_to_talk"
language: "ja"

asr:
  engine: "whispercpp"
  model_path: "models/medium-q5.gguf"
  threads: 4
  use_gpu: false

vad:
  silence_ms: 800
  energy_threshold: 0.015

commit:
  mode: "clipboard"
  restore_clipboard_ms: 1500

privacy:
  keep_audio: false
  keep_text_after_commit: false

🚀 Advanced Usage¶

マルチエンジン対応：Sherpa-ONNX¶

using SherpaOnnx;

var config = new OfflineRecognizerConfig();
config.ModelConfig.Tokens = @"models\sensevoice\tokens.txt";
config.ModelConfig.NumThreads = 4;
config.ModelConfig.Provider = "cpu";
config.ModelConfig.SenseVoice.Model = @"models\sensevoice\model.int8.onnx";
config.ModelConfig.SenseVoice.Language = "ja";
config.ModelConfig.SenseVoice.UseInverseTextNormalization = 1;

using var recognizer = new OfflineRecognizer(config);
using var stream = recognizer.CreateStream();
stream.AcceptWaveform(16000, audioSamples);
recognizer.Decode(stream);
Console.WriteLine(stream.Result.Text);

SenseVoice vs Whisper

SenseVoiceは日本語・英語・中国語に特化した軽量モデルで、短い発話の認識ではWhisperより高速な場合がある。逆テキスト正規化（数字・日付の自然な表記変換）もビルトインで対応している。

バッチ処理：複数ファイルの一括変換¶

from pathlib import Path
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")

for audio_file in Path("./recordings").glob("*.wav"):
    segments, _ = model.transcribe(str(audio_file), language="ja", vad_filter=True)
    text = "".join(s.text for s in segments)
    audio_file.with_suffix(".txt").write_text(text, encoding="utf-8")
    print(f"✅ {audio_file.name} → {audio_file.stem}.txt")

⚠️ Troubleshooting¶

問題	原因	解決策
`ModuleNotFoundError: No module named 'faster_whisper'`	未インストール	`pip install faster-whisper`
`FileNotFoundError: ffmpeg`	FFmpeg未インストール	`winget install Gyan.FFmpeg` → PATH追加
認識結果が空 / hallucination	無音区間の入力	VADフィルタを有効化、閾値を調整
メモリ不足（OOM）	モデルサイズ過大	`small`へ変更、または量子化モデルを使用
処理が遅い	スレッド数の設定ミス	`wmic cpu get NumberOfCores`で物理コア数を確認
日本語がおかしい	言語自動検出の誤判定	`language="ja"`を明示的に指定
クリップボード復元が失敗	復元タイミングが早すぎる	`restore_clipboard_ms`を2000以上に増加

NumPy互換性の問題¶

pip install "numpy<2.0"

Whisperの「幻覚」(Hallucination)対策¶

def filter_hallucination(text: str) -> str | None:
    if not text or len(text.strip()) < 2:
        return None
    if len(set(text.strip())) <= 2:
        return None
    hallucination_patterns = ["ご視聴ありがとうございました", "チャンネル登録", "字幕"]
    if any(p in text for p in hallucination_patterns):
        return None
    return text

対策の優先順位:

VADで無音区間をASRに渡さない
0.5秒未満の音声は無視
RMS値が閾値以下のチャンクはスキップ
極端に短い・繰り返しパターンの結果は破棄

まとめ¶

要素	推奨構成
ランタイム	whisper.cpp（デスクトップ）/ faster-whisper（スクリプト）
モデル	medium-q5（バランス）/ large-v3-turbo（高精度）
VAD	エネルギーベース（軽量）+ 無音トラッカー
テキスト挿入	クリップボード貼付（互換性優先）
設定管理	YAML外部ファイル
プライバシー	音声・テキストの非永続化

社内利用を検討している方へ

完全オフラインでの音声入力は、セキュリティ要件の厳しい企業環境でこそ価値を発揮する。MSIパッケージ化・組織ポリシーによる設定ロック・Windows Event Logによる監査ログなど、エンタープライズ向けの考慮事項についてはプロジェクトリポジトリの requirements.md を参照のこと。