Transcription accuracy starts with the recording, not the software
Clean audio is the biggest accuracy lever, and the effect is measurable. Automatic speech recognition word error rate falls as the signal-to-noise ratio rises, and degrades sharply once SNR drops below 5 dB. One evaluated model scored a 0.12 word error rate on clean speech versus 0.79 on noise-distorted speech across babble, car, street, and station noise. No cleanup step recovers detail the microphone never captured.
So spend the effort upstream. Mic each speaker close, off hard surfaces that boom, and away from vents, fridges, and fans. For remote calls, record per-channel where the platform allows it. The full capture checklist lives in the interview workflow guide; the short version is that a $30 lav clipped to a lapel beats a phone across the table, every time.
How accurate can transcription actually get?
Accuracy is measured as word error rate, or WER, defined as 100 x (insertions + substitutions + deletions) divided by the words in the reference transcript and scored with NIST's sclite script. The realistic ceiling is human: professional transcribers hit 5.9% WER on Switchboard conversational speech, rising to 11.3% on harder open-ended calls. Expect roughly 5–6% on clean, on-topic speech.
That number reframes the goal. You're not chasing a perfect transcript; even trained humans miss about one word in twenty on natural conversation. An AI first pass lands in the same neighborhood on clean audio, so your job is editing, not re-transcribing. Judge a tool on the errors it makes on your audio, not on a headline accuracy figure from a lab benchmark.
Why a word-perfect transcript can still be wrong
Word accuracy and speaker accuracy are different metrics. A transcript can be word-for-word correct yet attribute lines to the wrong person, which is why speaker labeling has its own score: diarization error rate, or DER, introduced for NIST's Rich Transcription Spring 2003 evaluation. It measures who-spoke-when over time – DER = (false alarm + missed + wrong-speaker) divided by total reference speaker time – not whether the words are right.
For attributable quotes, this is the accuracy that bites. Overlapping speech and single-mic recordings are where speaker labels slip. Record each speaker on a separate channel and the tool has far less to guess. With a mixed file you'll still get labels, but budget time to fix turns by hand around crosstalk, especially short interjections that land inside another person's sentence.
Verify the quotes you'll actually publish
Verification is where accuracy is won, and it's cheaper than starting over. Manual transcription runs up to about six hours per audio hour, so re-typing to fix errors is the slow path. Instead, read the draft against the audio and spend your attention on the load-bearing 5%: names, companies, acronyms, and fast-spoken numbers. Those are exactly the spots ASR misses.
Isolate the lines you'll cite and check each one against the recording before it ships. Word- or sentence-level timestamps let you jump to a line and hear it in seconds instead of scrubbing. Tools that pull and verify a specific quote make this faster. Flag anything genuinely unclear as [inaudible] with its timestamp rather than guessing; a flagged gap is honest, a confident wrong quote is a correction waiting to happen.
Accuracy is also a representational choice
In qualitative research, the verbatim style you choose shapes the analysis itself, not just how a quote reads. Naturalized transcription captures every utterance – stutters, pauses, false starts – while denaturalized transcription corrects grammar and strips interview noise. Oliver, Serovich and Mason frame that decision as a representational act with consequences for findings, and urge you to interrogate its impact rather than treat it as cosmetic.
So set the rule before you edit, and hold it across the whole transcript. Whatever style you pick, never silently correct a factual slip a source made. If they say the wrong year, keep the wrong year and mark it: the standard fix is a bracketed sic marker, italicized and inserted immediately after the error, signaling the mistake is the source's and not yours. Accuracy to the record beats a tidy-looking quote.