ADR: Transcription Provider Migration to Deepgram Nova-3

Status: Accepted (March 2024) Issue: TF-272 Related: TF-260 (Continuous Dictation)

Context

Verba originally used OpenAI's Whisper API (whisper-1 model) for speech-to-text transcription. During development of the Continuous Dictation feature (TF-260), a fundamental limitation was discovered: Whisper hallucinates on short audio segments, producing completely fabricated text.

The Problem

When audio is split into segments (5-15 seconds each, separated by natural speech pauses), Whisper produces non-deterministic hallucinations on 10-20% of segments:

Actual Speech	Whisper Output
German legal text about company branches	"Microsoft Office Word Document MSWordDoc Word.Document.8"
German legal text about business activities	"In this video tutorial we've installed Ubuntu 17.04..."
German legal text about mining companies	"on, in, in, in, Solutions, eBay, FidelityCode..."
Silence after pause	"Thank you for watching!"

These hallucinations occur on real speech (not just silence), making them impossible to filter without discarding legitimate transcriptions. This is a known limitation of the Whisper model architecture, not a Verba code bug.

Additional Issue: Audio Pipeline Latency

The initial continuous dictation approach used ffmpeg's silencedetect filter to identify pauses, then extracted segments from the recording file. ffmpeg's internal pipeline introduces 2-5 seconds of latency between detected silence timestamps and actual file content, causing segment extraction timing issues.

Over 10 iterations of attempted mitigations (raw PCM, flush packets, file-growth polling, byte-level extraction, WAV format switching, prompt context, volume detection) failed to resolve the core hallucination problem.

Alternatives Evaluated

Provider	Hallucinations	Streaming	Built-in VAD	Price/min	Language Support	Integration Effort
OpenAI Whisper API	Frequent, severe	No	No	$0.006	Excellent	Existing
Deepgram Nova-3	Rare (built-in VAD)	Yes (WebSocket)	Yes	$0.0043	Good (100+ langs)	Medium
Google Cloud STT v2	Rare	Yes (gRPC)	Yes	$0.006	Excellent	Large (gRPC, auth)
AssemblyAI	Rare	Yes (WebSocket)	Yes	$0.010	Good	Medium
Azure Speech Services	Rare	Yes	Yes	$0.010	Excellent	Medium (Azure SDK)
Groq Whisper	Same as Whisper	No	No	Cheaper	Same	Small (API compat)
Local whisper.cpp	Same as Whisper	No	Threshold only	Free	Same	Existing

An alternative architecture (stop-and-restart ffmpeg at each pause) was also considered but rejected because it still relies on Whisper and would not eliminate the hallucination problem.

Decision

Deepgram Nova-3 was selected as the new transcription provider for both single-shot and continuous dictation.

Key Reasons

Built-in VAD -- Deepgram's Voice Activity Detection handles pause detection and utterance segmentation at the API level. This eliminates the need for ffmpeg silencedetect, segment extraction, and all associated timing issues.
Minimal hallucinations -- Nova-3's architecture produces significantly fewer fabricated outputs compared to Whisper, especially on short segments.
WebSocket streaming -- Enables true real-time continuous dictation. Audio is piped from ffmpeg directly to the Deepgram WebSocket, and completed utterances are emitted as events. No file I/O, no segment extraction.
Lower cost -- $0.0043/min vs $0.006/min (28% reduction).
Shared SDK -- Both single-shot (transcribeFile) and continuous (LiveClient WebSocket) use the same @deepgram/sdk package and API key.

Consequences

Positive

Continuous dictation works reliably (no hallucinations, no timing issues)
Single-shot transcription cost reduced by 28%
Simpler architecture for continuous mode (ffmpeg stdout pipe to WebSocket)
Single API key for all transcription modes

Negative

New API key required (Deepgram instead of OpenAI for transcription)
Users with "provider": "openai" in settings must update to "deepgram"
openai npm package is still required for embeddings (context search)
Glossary integration changes from Whisper prompt parameter to Deepgram keywords parameter (different token budget: ~300 vs ~224 tokens)

Neutral

Local offline transcription via whisper.cpp remains unchanged
Claude post-processing pipeline is not affected