Pepys

Guide

Word error rate, explained

What WER actually measures, why a single accuracy percentage can mislead, and how to read the number a vendor quotes you.

The short answer

Word error rate (WER) measures speech-to-text accuracy as the share of words a system gets wrong. You align its output against a correct reference transcript, then divide the substitutions, deletions, and insertions by the number of reference words. A 10% WER means one word in ten is wrong. Lower is better, and the score depends far more on the audio than on the tool.

What is word error rate?

Word error rate is the standard way to score how well speech-to-text did against a known-correct transcript. You line up the machine's output with a reference transcript and count three kinds of mistakes: substitutions (a wrong word), deletions (a missing word), and insertions (an extra word the system added). The formula is 100 × (substitutions + deletions + insertions) ÷ total words in the reference transcript (Jurafsky & Martin, SLP3). A WER of 10% means one word in ten is wrong.

The alignment behind that count isn't guesswork. The reference tool computes the minimum edit distance between the two word strings, the fewest single-word changes needed to turn one into the other. It's the same idea behind a spell-checker's suggestions. The field-standard implementation is sclite, a free script from NIST, so two labs both reporting WER are usually measuring it the same way.

WER caught on because it's simple and comparable. One number, lower is better, and it works across any language with clear word boundaries. That comparability is exactly why benchmarks and vendor datasheets quote it. The catch is that the number hides the audio that produced it.

Why does '99% accurate' marketing mislead?

A single accuracy percentage hides the one thing that decides it: the audio. WER is a property of a tool running on specific audio, not of the tool alone. The same system that scores near-perfect on a clean studio read can miss one word in three on a noisy, accented phone call. So '99% accurate,' with no audio described, tells you almost nothing about what you'll get on your own recording.

Those very low WER figures usually come from clean read-aloud benchmarks. LibriSpeech, a widely used yardstick, is roughly 1,000 hours of read English audiobook speech (OpenSLR), carefully segmented, from public-domain LibriVox recordings. It's clean, scripted, single-speaker audio, nothing like a two-person interview in a café. A number earned on read-aloud text is a best case that your own file rarely matches.

One more reason the percentage misleads: WER can go above 100% (Jurafsky & Martin, SLP3). Because insertions count in the numerator, a system that adds enough extra words can rack up more errors than there are words in the reference. To illustrate: a transcript that invents enough content can score worse than a blank page (our own example, not a cited figure). 'Accuracy' framed as 100 minus WER stops making sense there.

What counts as a good word error rate?

There's a human yardstick worth anchoring to. On conversational telephone speech, professional transcribers hit about 5.9% WER on the Switchboard corpus and 11.3% on the harder CallHome set (Xiong et al., 2016). The same study's automated system matched them, at 5.8% and 11.0%. So even skilled humans miss roughly one word in twenty on real conversation.

Machine WER swings hard with audio quality. In one evaluation, a speech model scored 0.12 WER on clean speech but 0.79 on noise-and-network-distorted audio before fine-tuning (Kumalija & Nakamoto, 2022). Accuracy fell off sharply once the signal-to-noise ratio dropped below about 5 dB. Same model and words, but the scores diverge, driven almost entirely by the recording.

WER isn't equal across speakers, either. Auditing five major commercial systems, researchers found an aggregate 0.35 WER for Black speakers versus 0.19 for white speakers, nearly double (Koenecke et al., 2020, PNAS). A separate audit of the big cloud services found they performed significantly better for first-language English speakers (DiChristofano et al., 2022). A vendor's headline number won't tell you how it treats your speakers.

So there's no single 'good' WER; it depends on your audio and your stakes. The human benchmark near 6% on clean conversational speech is a fair target to judge a tool against. What matters more is the trend: a number in the mid-single digits usually means light cleanup, while a high one usually points back to the recording. When the score is high, the fix is almost always the audio itself.

Is word error rate the only accuracy metric?

No. WER has known blind spots, so researchers reach for companion metrics. Match error rate (MER) and word information lost (WIL) were introduced in 2004 as improved measures (Morris, Maier & Green). Unlike WER, which is unbounded and, at high error rates, weights insertions more heavily than deletions, WIL stays bounded between 0 and 1 and treats the two symmetrically.

For some languages, WER barely works. Character error rate (CER) is the character-level counterpart, used where there are no clear word boundaries, like Chinese and Thai (Thennal et al., 2025). There, WER would need a subjective word-segmentation step first, so CER measures the share of characters wrong instead (TorchMetrics).

And word accuracy says nothing about who said what. Getting the speaker labels right is a separate axis, measured by diarization error rate (DER): missed speech, false alarms, and speaker confusions over total speaking time, a metric NIST introduced in 2003. A transcript can post a low WER and still mislabel who's speaking in a multi-person recording.

What word error rate can and can't tell you

Read WER as a comparison tool, not a promise. It ranks systems fairly on the same audio, and it gives you a single, checkable number. What it can't do is predict your result from someone else's benchmark, because your recording, your speakers, and your acoustics set the real score. Treat any lone accuracy figure as a prompt to ask what audio produced it.

For a working transcript, the errors that actually reach your final text matter more than the aggregate percentage. A 5% WER that garbles the one name or number you're quoting is worse than a 7% that only drops filler words. So spot-check the load-bearing words you'll actually publish against the audio. The quotes you print are the real test of accuracy.

Tips from people who do this a lot

  • Always ask what audio a WER was measured on. A 2% score on read-aloud audiobook text tells you nothing about a noisy group interview.

  • Judge a tool against the roughly 6% human benchmark on clean conversational speech, and ignore a '99% accurate' claim that names no audio.

  • Insertions can push WER above 100%, so 'accuracy = 100 minus WER' can be mathematically meaningless at high error rates. Ask for the raw WER.

  • WER ignores speaker labels entirely. If who-said-what matters, look for diarization error rate (DER) as a separate number.

  • For a real transcript, count only the errors in the words you'll quote or cite. Aggregate WER over a whole file overstates what you actually have to fix.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

Word error rate – questions, answered

What is a good word error rate?

It depends on the audio. On clean conversational speech, professional human transcribers land near 5.9% WER (Xiong et al., 2016), so that's a fair target for a tool. A number in the mid-single digits usually means light cleanup; a high one usually points to the recording.

How is word error rate calculated?

You align the machine's transcript to a correct reference, then add up substitutions, deletions, and insertions and divide by the number of reference words, times 100. The alignment uses minimum edit distance, and NIST's free sclite script is the standard tool for computing it (Jurafsky & Martin).

Can word error rate be over 100%?

Yes. Because inserted words count as errors in the numerator, a system that adds enough extra or invented words can score above 100% (Jurafsky & Martin). That's exactly why treating accuracy as simply 100 minus WER breaks down once error rates climb.

Does a 99% accuracy claim mean 1% WER?

Not usefully. Such low figures typically come from clean, read-aloud benchmarks like LibriSpeech, roughly 1,000 hours of scripted audiobook speech (OpenSLR). Real interviews, with noise, accents, and crosstalk, score much higher WER, so a headline percentage without the audio described tells you little.

Is word error rate the same as diarization accuracy?

No. WER scores which words are right; diarization error rate (DER) scores who-said-what, from missed speech, false alarms, and speaker confusions (DIHARD, NIST). A transcript can have a low WER and still label the wrong speaker, so check both metrics when speakers matter.

References

  1. 1.Jurafsky & Martin, Speech and Language Processing (SLP3), Ch. 15, §15.6 'ASR Evaluation: Word Error Rate'Stanford University
  2. 2.LibriSpeech (SLR12) corpus page – ~1,000 hours of read English audiobook speechOpenSLR
  3. 3.Xiong et al. (2016), Achieving Human Parity in Conversational Speech Recognition (arXiv:1610.05256)Microsoft Research / arXiv
  4. 4.Kumalija & Nakamoto (2022), Performance evaluation of ASR on integrated noise-network distorted speechFrontiers in Signal Processing
  5. 5.Koenecke et al. (2020), Racial disparities in automated speech recognition, PNAS 117(14):7684-7689PNAS
  6. 6.DiChristofano et al. (2022), Global Performance Disparities Between English-Language Accents in ASR (arXiv:2208.01157)Washington University in St. Louis / arXiv
  7. 7.Morris, Maier & Green (2004), From WER and RIL to MER and WIL, Interspeech 2004ISCA / Interspeech
  8. 8.Thennal D K et al. (2025), Advocating Character Error Rate for Multilingual ASR Evaluation, Findings of NAACL 2025Association for Computational Linguistics
  9. 9.First DIHARD Challenge – Diarization Error Rate definition (NIST RT-03S metric)DIHARD Challenge
  10. 10.TorchMetrics documentation, CharErrorRateLightning AI (TorchMetrics)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.