What is word error rate?
Word error rate is the standard way to score how well speech-to-text did against a known-correct transcript. You line up the machine's output with a reference transcript and count three kinds of mistakes: substitutions (a wrong word), deletions (a missing word), and insertions (an extra word the system added). The formula is 100 × (substitutions + deletions + insertions) ÷ total words in the reference transcript (Jurafsky & Martin, SLP3). A WER of 10% means one word in ten is wrong.
The alignment behind that count isn't guesswork. The reference tool computes the minimum edit distance between the two word strings, the fewest single-word changes needed to turn one into the other. It's the same idea behind a spell-checker's suggestions. The field-standard implementation is sclite, a free script from NIST, so two labs both reporting WER are usually measuring it the same way.
WER caught on because it's simple and comparable. One number, lower is better, and it works across any language with clear word boundaries. That comparability is exactly why benchmarks and vendor datasheets quote it. The catch is that the number hides the audio that produced it.
Why does '99% accurate' marketing mislead?
A single accuracy percentage hides the one thing that decides it: the audio. WER is a property of a tool running on specific audio, not of the tool alone. The same system that scores near-perfect on a clean studio read can miss one word in three on a noisy, accented phone call. So '99% accurate,' with no audio described, tells you almost nothing about what you'll get on your own recording.
Those very low WER figures usually come from clean read-aloud benchmarks. LibriSpeech, a widely used yardstick, is roughly 1,000 hours of read English audiobook speech (OpenSLR), carefully segmented, from public-domain LibriVox recordings. It's clean, scripted, single-speaker audio, nothing like a two-person interview in a café. A number earned on read-aloud text is a best case that your own file rarely matches.
One more reason the percentage misleads: WER can go above 100% (Jurafsky & Martin, SLP3). Because insertions count in the numerator, a system that adds enough extra words can rack up more errors than there are words in the reference. To illustrate: a transcript that invents enough content can score worse than a blank page (our own example, not a cited figure). 'Accuracy' framed as 100 minus WER stops making sense there.
What counts as a good word error rate?
There's a human yardstick worth anchoring to. On conversational telephone speech, professional transcribers hit about 5.9% WER on the Switchboard corpus and 11.3% on the harder CallHome set (Xiong et al., 2016). The same study's automated system matched them, at 5.8% and 11.0%. So even skilled humans miss roughly one word in twenty on real conversation.
Machine WER swings hard with audio quality. In one evaluation, a speech model scored 0.12 WER on clean speech but 0.79 on noise-and-network-distorted audio before fine-tuning (Kumalija & Nakamoto, 2022). Accuracy fell off sharply once the signal-to-noise ratio dropped below about 5 dB. Same model and words, but the scores diverge, driven almost entirely by the recording.
WER isn't equal across speakers, either. Auditing five major commercial systems, researchers found an aggregate 0.35 WER for Black speakers versus 0.19 for white speakers, nearly double (Koenecke et al., 2020, PNAS). A separate audit of the big cloud services found they performed significantly better for first-language English speakers (DiChristofano et al., 2022). A vendor's headline number won't tell you how it treats your speakers.
So there's no single 'good' WER; it depends on your audio and your stakes. The human benchmark near 6% on clean conversational speech is a fair target to judge a tool against. What matters more is the trend: a number in the mid-single digits usually means light cleanup, while a high one usually points back to the recording. When the score is high, the fix is almost always the audio itself.
Is word error rate the only accuracy metric?
No. WER has known blind spots, so researchers reach for companion metrics. Match error rate (MER) and word information lost (WIL) were introduced in 2004 as improved measures (Morris, Maier & Green). Unlike WER, which is unbounded and, at high error rates, weights insertions more heavily than deletions, WIL stays bounded between 0 and 1 and treats the two symmetrically.
For some languages, WER barely works. Character error rate (CER) is the character-level counterpart, used where there are no clear word boundaries, like Chinese and Thai (Thennal et al., 2025). There, WER would need a subjective word-segmentation step first, so CER measures the share of characters wrong instead (TorchMetrics).
And word accuracy says nothing about who said what. Getting the speaker labels right is a separate axis, measured by diarization error rate (DER): missed speech, false alarms, and speaker confusions over total speaking time, a metric NIST introduced in 2003. A transcript can post a low WER and still mislabel who's speaking in a multi-person recording.
What word error rate can and can't tell you
Read WER as a comparison tool, not a promise. It ranks systems fairly on the same audio, and it gives you a single, checkable number. What it can't do is predict your result from someone else's benchmark, because your recording, your speakers, and your acoustics set the real score. Treat any lone accuracy figure as a prompt to ask what audio produced it.
For a working transcript, the errors that actually reach your final text matter more than the aggregate percentage. A 5% WER that garbles the one name or number you're quoting is worse than a 7% that only drops filler words. So spot-check the load-bearing words you'll actually publish against the audio. The quotes you print are the real test of accuracy.