What does '99% accurate' actually measure?
It measures clean, read-aloud speech, not the messy audio you're recording. The near-perfect figures quoted in speech-to-text marketing trace back to benchmarks like LibriSpeech, which is built from read passages of public-domain audiobooks, carefully segmented and aligned. That's one narrator, no crosstalk, studio-quiet – roughly the easiest possible input.
Your interview is the opposite. The moment you move to spontaneous conversation, the numbers change. On the Switchboard conversational test set, Microsoft's system and professional human transcribers both landed near 5.9% word error, 5.8% for the machine. Both roughly doubled to 11% on the harder CallHome calls. So 5–6% is close to the floor on real conversation, even at human parity.
The gap matters because a benchmark score isn't a promise about your file. Read speech and conversational speech are different problems. When a tool advertises a single accuracy number, assume it's the read-speech ceiling – and expect real dialogue to sit lower.
How is transcription accuracy actually scored?
With word error rate, the standard metric. Jurafsky and Martin's Speech and Language Processing defines WER as 100 times insertions plus substitutions plus deletions, divided by the words in the correct transcript. The two transcripts are aligned by minimum edit distance, then scored with NIST's free sclite tool.
One quirk trips people up: WER can exceed 100%. Because inserted words – text the system added that no one said – count as errors, a transcript that invents enough content can score worse than a blank page. A 5% WER means one word in twenty is wrong: substituted, dropped, or added.
A single percentage also hides where the errors land. WER treats a missed 'uh' and a mangled surname as equal, but they aren't equal to you. Ten filler-word misses won't hurt a quote. One wrong name in the sentence you publish will. Judge accuracy by the errors that reach your final text, not the aggregate.
Why does your audio score worse than the benchmark?
Because real speech carries accents, dialects, overlap, and noise the benchmark strips out. Accent alone is measurable: across three major cloud services, accuracy was substantially worse for non-native English speakers. Measured as word information lost, first-language-English speakers scored about 0.14 better on average, with Mandarin, Spanish, and Russian speakers hit hardest.
Dialect widens the gap further. A 2020 PNAS study spanning five commercial systems from Amazon, Apple, Google, IBM, and Microsoft found word error roughly twice as high for Black speakers as white speakers – an aggregate 0.35 versus 0.19. Same recording quality, very different result depending on who's speaking.
Then there's structure. Two people talking over each other is the single hardest case for both accuracy and speaker labeling, because the model has to untangle who said what while it transcribes. Recording each speaker on a separate channel is the single most effective input-side fix you control.
Can AI invent words that were never said?
Yes – and this failure looks nothing like a typo. In an audit that ran 13,140 audio segments through OpenAI's Whisper, roughly 1% of transcriptions contained entirely hallucinated phrases that appeared nowhere in the audio. Of those hallucinations, 38% carried harmful content, including violent language in about 19%.
Hallucinations are dangerous precisely because they're fluent. A garbled word looks wrong and gets caught. An invented sentence reads smoothly, passes spellcheck, and survives a quick proofread. That's the error most likely to slide into a published quote unnoticed, so it's the one to hunt for on purpose.
The defense is verification, not trust. Before you cite any line, pull the quote and check it against the source audio. If a passage sounds too clean or too on-topic for the moment it appears, treat it as suspect and listen back.
So how accurate is AI transcription for your own recording?
Plan for high accuracy on clean audio and a working draft on everything else. On a close-mic'd, low-noise recording of clear speakers, expect to change a handful of words per minute. On a noisy, multi-speaker, accented recording, expect more – the conversational 5–6% word-error floor only holds when conditions are good.
That's why the practical answer is a workflow, not a percentage. Let AI do the bulk first pass, then spend your attention where machines fail: names, numbers, jargon, and crosstalk. If you're doing this end to end, the interview transcription workflow walks the full process, and there are concrete ways to improve transcription accuracy before you even hit upload.
In practice, the input decides almost everything. A quiet room and separated speakers routinely turn a mediocre transcript into a near-clean one, while no amount of model choice rescues a phone left across a noisy table. Fix the audio first; the accuracy follows.