The accent penalty is real, and it's been measured
Accent and dialect change how accurately speech-to-text works, and the effect is large. In the Edinburgh accented-English benchmark, the best model averaged a 19.7% word error rate across accents but only 2.7% on US clean read speech (Sanabria et al., 2023). Every model tested did worst on Indian, Jamaican, and Nigerian English.
Commercial systems show the same pattern by dialect. Across five services from Amazon, Apple, Google, IBM, and Microsoft, the aggregate word error rate was 0.35 for Black speakers versus 0.19 for white speakers (Koenecke et al., 2020). That's nearly double for every system tested. The study looked at African American Vernacular English specifically, but the shape of the gap recurs across accents.
For your own work, the read is simple. An AI first pass is still the fastest way to draft. Just expect more errors on strongly accented or dialectal speech, and plan your review time around that.
Why accented English trips up AI: pronunciation and prosody, not grammar
The errors come from how words sound, not from the words themselves. When Koenecke's team compared identical phrases with the same ground-truth transcript, the word error rate was still roughly twice as high for Black speakers (Koenecke et al., 2020). Because the wording was the same, vocabulary and content can't explain the gap.
The authors traced it to the acoustic models, the part of a system that maps sound to phonemes. In their analysis, the systems were confused by the phonological, phonetic, and prosodic features of the dialect, not its grammar or vocabulary. The likely cause was too little audio from Black speakers in the training data (Koenecke et al., 2020). That attribution is specific to this dialect, and the authors frame it as a likely cause, not a proven one.
Prosody is worth defining, because it's doing the damage. It covers rhythm, pitch, syllable stress, vowel length, and lenition, meaning softened or dropped sounds. These shift with accent and dialect, and they're the cues an acoustic model relies on. When a model has heard little of a given accent, those cues read as noise instead of words.
Non-native English runs into the same wall
Speaking English as a second language carries a similar cost. Across commercial services, word information lost, a word-error-style metric, was 0.14 lower on average for first-language English speakers. The weakest results came from speakers whose first language was Mandarin, Spanish, or Russian (DiChristofano et al., 2022).
Be careful about the cause, though. That study's headline finding tied accuracy to the geopolitical alignment of a speaker's birth country, not simply to how much training audio each accent had. The second-language accent penalty is well documented; the reason behind it is less settled than it is for one specific dialect.
If you want the wider picture of how these error rates are calculated and what a good number looks like, that sits in how accurate AI transcription is. Here the point is narrower. Accent, native or not, is one of the biggest single factors moving that number.
Noise and poor audio stack on top of accent
Accent isn't the only thing pulling accuracy down, and the factors compound. In one study of how models cope with background noise, a fine-tuned model held steady at signal-to-noise ratios of 5 dB and above but dropped sharply below that, independent of the noise type (Frontiers in Signal Processing, 2022). That's a general finding about noise, not about accents.
The two problems land on the same file, though. A strong accent already gives the acoustic model less to work with. Add background noise, a distant microphone, or heavy crosstalk, and you've stacked two hard problems on one recording. The accent you can't change. The audio you usually can.
So the highest-value move is often the audio, not the accent. Recording close to each speaker, in a quiet room, on separate channels does more for a transcript of accented speech than any setting in the tool. The full checklist for that lives in how to improve transcription accuracy, so there's no need to repeat mic technique here.
How to transcribe accented English into a usable transcript
Two things reliably help: better audio and targeted human review. For reference, even professional human transcribers reach about 5.9% word error rate on conversational speech (Xiong et al., 2016). No draft is flawless, so aim for a correct final quote rather than a perfect machine transcript.
The workflow that holds up is an AI first pass followed by review. Get a speaker-labeled, timestamped draft, then read it against the audio. You can start that first pass on your own recording with AI interview transcription. For accented speech, budget extra time on names, technical terms, numbers, and any passage where two people overlap.
Above all, check the quotes you'll publish or code against the source audio. Pull the exact line, listen to it, and confirm the wording before it goes anywhere; a tool like quote transcription makes that spot-check quick. On accented English, one confidently wrong word in a key quote is the failure that actually costs you.