Pepys

Guide

How accurate is AI transcription?

What the benchmarks really measure, where AI slips, and how accurate to expect a transcript of your own recording to be.

The short answer

How accurate AI transcription is depends on your audio, not the marketing number. Near-99% figures come from clean, read-aloud benchmarks. On real conversation, the best systems and professional human transcribers both land around 5–6% word error rate, roughly one word in twenty. Accents, crosstalk, and noise push errors higher, and models can occasionally hallucinate phrases no one actually said.

What does '99% accurate' actually measure?

It measures clean, read-aloud speech, not the messy audio you're recording. The near-perfect figures quoted in speech-to-text marketing trace back to benchmarks like LibriSpeech, which is built from read passages of public-domain audiobooks, carefully segmented and aligned. That's one narrator, no crosstalk, studio-quiet – roughly the easiest possible input.

Your interview is the opposite. The moment you move to spontaneous conversation, the numbers change. On the Switchboard conversational test set, Microsoft's system and professional human transcribers both landed near 5.9% word error, 5.8% for the machine. Both roughly doubled to 11% on the harder CallHome calls. So 5–6% is close to the floor on real conversation, even at human parity.

The gap matters because a benchmark score isn't a promise about your file. Read speech and conversational speech are different problems. When a tool advertises a single accuracy number, assume it's the read-speech ceiling – and expect real dialogue to sit lower.

How is transcription accuracy actually scored?

With word error rate, the standard metric. Jurafsky and Martin's Speech and Language Processing defines WER as 100 times insertions plus substitutions plus deletions, divided by the words in the correct transcript. The two transcripts are aligned by minimum edit distance, then scored with NIST's free sclite tool.

One quirk trips people up: WER can exceed 100%. Because inserted words – text the system added that no one said – count as errors, a transcript that invents enough content can score worse than a blank page. A 5% WER means one word in twenty is wrong: substituted, dropped, or added.

A single percentage also hides where the errors land. WER treats a missed 'uh' and a mangled surname as equal, but they aren't equal to you. Ten filler-word misses won't hurt a quote. One wrong name in the sentence you publish will. Judge accuracy by the errors that reach your final text, not the aggregate.

Why does your audio score worse than the benchmark?

Because real speech carries accents, dialects, overlap, and noise the benchmark strips out. Accent alone is measurable: across three major cloud services, accuracy was substantially worse for non-native English speakers. Measured as word information lost, first-language-English speakers scored about 0.14 better on average, with Mandarin, Spanish, and Russian speakers hit hardest.

Dialect widens the gap further. A 2020 PNAS study spanning five commercial systems from Amazon, Apple, Google, IBM, and Microsoft found word error roughly twice as high for Black speakers as white speakers – an aggregate 0.35 versus 0.19. Same recording quality, very different result depending on who's speaking.

Then there's structure. Two people talking over each other is the single hardest case for both accuracy and speaker labeling, because the model has to untangle who said what while it transcribes. Recording each speaker on a separate channel is the single most effective input-side fix you control.

Can AI invent words that were never said?

Yes – and this failure looks nothing like a typo. In an audit that ran 13,140 audio segments through OpenAI's Whisper, roughly 1% of transcriptions contained entirely hallucinated phrases that appeared nowhere in the audio. Of those hallucinations, 38% carried harmful content, including violent language in about 19%.

Hallucinations are dangerous precisely because they're fluent. A garbled word looks wrong and gets caught. An invented sentence reads smoothly, passes spellcheck, and survives a quick proofread. That's the error most likely to slide into a published quote unnoticed, so it's the one to hunt for on purpose.

The defense is verification, not trust. Before you cite any line, pull the quote and check it against the source audio. If a passage sounds too clean or too on-topic for the moment it appears, treat it as suspect and listen back.

So how accurate is AI transcription for your own recording?

Plan for high accuracy on clean audio and a working draft on everything else. On a close-mic'd, low-noise recording of clear speakers, expect to change a handful of words per minute. On a noisy, multi-speaker, accented recording, expect more – the conversational 5–6% word-error floor only holds when conditions are good.

That's why the practical answer is a workflow, not a percentage. Let AI do the bulk first pass, then spend your attention where machines fail: names, numbers, jargon, and crosstalk. If you're doing this end to end, the interview transcription workflow walks the full process, and there are concrete ways to improve transcription accuracy before you even hit upload.

In practice, the input decides almost everything. A quiet room and separated speakers routinely turn a mediocre transcript into a near-clean one, while no amount of model choice rescues a phone left across a noisy table. Fix the audio first; the accuracy follows.

Tips from people who do this a lot

  • A single 'word accuracy' percentage hides where errors cluster – one botched name in a pull-quote hurts more than ten missed filler words.

  • Read versus conversational is the sharpest divider: a lecture read from notes scores far better than a three-person debate over one mic.

  • Watch for fluent nonsense. Hallucinated text reads smoothly and passes spellcheck, so it survives editing that a garbled word never would.

  • Non-native and accented speech scores measurably worse on every major engine, so budget extra review time when a speaker's first language isn't English.

  • WER can exceed 100% because insertions count – a system that adds words no one said can score worse than a blank transcript.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

How accurate is ai transcription – questions, answered

How accurate is AI transcription, in a number?

On clean, read-aloud audio, top systems reach the low single-digit error rates seen on research benchmarks. On spontaneous conversation, both the best automated systems and professional human transcribers sit near 5–6% word error on the Switchboard test set – about one word in twenty wrong.

What is word error rate (WER)?

WER is the standard accuracy metric: insertions plus substitutions plus deletions, divided by the number of words in the correct transcript, scored by aligning the two transcripts. Because insertions count, it can exceed 100%. NIST's free sclite script is the reference tool for computing it.

Why is AI less accurate on some speakers?

Accent and dialect matter. One 2020 study across five commercial systems found word error roughly twice as high for Black speakers as white speakers (0.35 versus 0.19). Non-native English speech also scores substantially worse, so plan extra review for accented or multilingual recordings.

Can AI transcription make up words that were never said?

Yes. An audit of OpenAI's Whisper found roughly 1% of audio segments contained entirely hallucinated phrases absent from the recording, and 38% of those hallucinations carried harmful content. Hallucinations read fluently, so verify anything you'll quote against the source audio.

How do I get a more accurate transcript?

Fix the input first: close mics, low background noise, and one speaker per channel do more than any software setting. Then review the spots AI struggles with – names, numbers, jargon, and crosstalk – and check every quote against the audio before citing it.

References

  1. 1.Panayotov et al. (2015), LibriSpeech: an ASR corpus based on public domain audio booksOpenSLR / IEEE ICASSP
  2. 2.Xiong et al. (2016), Achieving Human Parity in Conversational Speech RecognitionMicrosoft Research / arXiv
  3. 3.DiChristofano et al. (2022), Global Performance Disparities Between English-Language Accents in ASRWashington University in St. Louis / arXiv
  4. 4.Koenecke et al. (2020), Racial disparities in automated speech recognitionPNAS (peer-reviewed)
  5. 5.Koenecke et al. (2024), Careless Whisper: Speech-to-Text Hallucination HarmsACM FAccT 2024 (peer-reviewed)
  6. 6.Jurafsky & Martin, Speech and Language Processing (3rd ed. draft), Ch. 15Stanford (jurafsky/slp3)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.