Transcription accuracy, by the numbers
On clean, read speech, the best AI speech recognition now rivals professional humans, with word error rates near 5 to 6 percent on standard benchmarks. Accuracy drops with accents, background noise, and overlapping speakers. The figures below are drawn from primary research, each linked to its source.
- 5.8% machine vs 5.9% human WER
Microsoft's automated system reached a 5.8% word error rate on Switchboard, edging past the 5.9% error rate of professional human transcribers – the original 'human parity' result. Switchboard portion of the NIST 2000 (Hub5'00) conversational telephone speech test set; Microsoft 2016 system (Xiong et al.)
Achieving Human Parity in Conversational Speech Recognition (Xiong et al., Microsoft) (2016) - 11.0% machine vs 11.3% human WER
On the harder CallHome (open-ended friends/family) portion, Microsoft's system scored 11.0% WER versus 11.3% for human transcribers. CallHome portion of the NIST 2000 (Hub5'00) test set; Microsoft 2016 system (Xiong et al.)
Achieving Human Parity in Conversational Speech Recognition (Xiong et al., Microsoft) (2016) - 5.1% WER
Microsoft's 2017 system lowered the Switchboard word error rate to 5.1%, a new state of the art at the time. Switchboard portion of the NIST 2000 (Hub5'00) evaluation set; Microsoft 2017 system (Xiong et al.)
The Microsoft 2017 Conversational Speech Recognition System (Xiong et al.) (2017) - 5.5%/10.3% machine; 5.1%/6.8% human WER
IBM's ASR system reached 5.5% WER on Switchboard and 10.3% on CallHome; IBM's own human-transcription study measured the best human transcriber at 5.1% (Switchboard) and 6.8% (CallHome), revising the human benchmark lower than Microsoft's 5.9%. Switchboard/CallHome subsets of the NIST Hub5 2000 evaluation; IBM 2017 system with human study by Appen (Saon et al., INTERSPEECH 2017)
English Conversational Telephone Speech Recognition by Humans and Machines (Saon et al., IBM) (2017) - 0.35 vs 0.19 WER
Commercial ASR systems averaged nearly double the word error rate for Black speakers versus matched white speakers, a dialect/accent gap traced to acoustic models under-trained on Black speech. Average across 5 commercial ASR systems (Amazon, Apple, Google, IBM, Microsoft) on matched interview speech from the CORAAL and VOC corpora
Racial disparities in automated speech recognition (Koenecke et al., PNAS) (2020) - 11.8% to 43.3% WER
A conventional single-talker ASR system's word error rate nearly quadrupled as the amount of overlapping speech rose, quantifying the cost of multiple speakers talking at once. Single-channel, no speech-separation front end, on the LibriCSS meeting corpus: 0% speaker overlap vs 40% speaker-overlap ratio
Continuous Speech Separation: Dataset and Analysis (Chen et al., ICASSP 2020, Microsoft) (2020) - >15x, sometimes >30x real time
Careful manual annotation for speaker diarization runs at real-time rates typically greater than 15x and sometimes exceeding 30x (i.e. 15-30+ hours of annotator work per hour of audio), which is why the DIHARD III organizers abandoned fully manual segmentation. Manual spectrogram-based diarization/segmentation annotation, DIHARD III corpus preparation (11 diverse domains)
The Third DIHARD Diarization Challenge (Ryant et al., Interspeech 2021) (2021) - 30 hours (lab) / 36 hours (field) per audio hour
Producing accurate human-corrected transcripts of a predominantly oral, low-resource language took on average 30 hours of human labor per hour of speech in the lab and 36 hours per hour in the field. Human correction of ASR output for Bambara; one-month field study, 10 native transcribers, 53 hours of speech
Cost Analysis of Human-corrected Transcription for Predominately Oral Languages (arXiv:2510.12781) (2025) - 13.45% DER (Track 1)
The best single system's diarization error rate on Track 1 (with reference speech activity marks) fell to 13.45%, a 43% relative improvement over the 23.73% best system of DIHARD I. Best single submission, DIHARD III core evaluation set, Track 1 (reference SAD provided)
The Third DIHARD Diarization Challenge (Ryant et al., Interspeech 2021 / arXiv:2012.01477) (2021) - 19.37% DER (Track 2)
On the harder Track 2 (diarization from scratch, no reference speech activity), the best single system reached 19.37% DER, down 46% from DIHARD I's 35.51%. Best single submission, DIHARD III core evaluation set, Track 2 (diarization from scratch)
The Third DIHARD Diarization Challenge (Ryant et al., Interspeech 2021 / arXiv:2012.01477) (2021) - 14.09% DER (Track 1 core; 11.58% full)
The Hitachi-JHU end-to-end + x-vector system, combined via DOVER-Lap, achieved 11.58% DER on the Track 1 full set and 14.09% on the Track 1 core set, taking second place across all challenge tasks. Hitachi-JHU system, DIHARD III evaluation set, Track 1 (reference SAD)
The Hitachi-JHU DIHARD III System (Horiguchi et al., arXiv:2102.01363) (2021)
Want the plain-English version? Read how accurate AI transcription is, what word error rate actually measures, and how to improve your own accuracy.
Sources
- 1.Achieving Human Parity in Conversational Speech Recognition (Xiong et al., Microsoft) (2016)
- 2.The Microsoft 2017 Conversational Speech Recognition System (Xiong et al.) (2017)
- 3.English Conversational Telephone Speech Recognition by Humans and Machines (Saon et al., IBM) (2017)
- 4.Racial disparities in automated speech recognition (Koenecke et al., PNAS) (2020)
- 5.Continuous Speech Separation: Dataset and Analysis (Chen et al., ICASSP 2020, Microsoft) (2020)
- 6.The Third DIHARD Diarization Challenge (Ryant et al., Interspeech 2021) (2021)
- 7.Cost Analysis of Human-corrected Transcription for Predominately Oral Languages (arXiv:2510.12781) (2025)
- 8.The Third DIHARD Diarization Challenge (Ryant et al., Interspeech 2021 / arXiv:2012.01477) (2021)
- 9.The Hitachi-JHU DIHARD III System (Horiguchi et al., arXiv:2102.01363) (2021)
See your own accuracy
Upload a recording and judge the draft yourself. 60 minutes free, no card.
Frequently asked questions
How accurate is AI transcription?
On clean, clearly-recorded speech, leading AI transcription reaches word error rates around 5 to 6 percent, close to professional human transcribers. Accuracy falls on harder audio: heavy accents, background noise, crosstalk, and specialist vocabulary all push the error rate up.
What is a good word error rate?
Lower is better, and it depends on the audio. On clean benchmark speech, a WER under about 10 percent is strong and under 5 to 6 percent is near the human ceiling. On noisy, accented, or multi-speaker recordings, real-world error rates are often higher even for good systems.
Is AI transcription as accurate as a human?
On clean speech, close. Independent benchmarks have shown automated systems matching or slightly beating professional human transcribers on clean conversational audio. On difficult audio, skilled humans still lead, which is why a hybrid AI-first-pass-then-human-cleanup workflow is common.
Don't just take our word for it.
Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.