Transcription, defined
Transcription is the act of representing spoken words in written or printed form, and the written copy that results. The Oxford Advanced Learner's Dictionary defines a transcription as a written or printed copy of words that have been spoken. The term names both a process and its product: you transcribe a recording, and the file you end up with is the transcript.
It helps to separate transcription from two things people mix it up with. Translation swaps one language for another; transcription keeps the same language and simply writes it down. Speech recognition is the technology that listens and predicts the words – one way to produce a transcript, not another name for the job.
There's a related task worth separating out too. Turning speech into words is distinct from working out who said each line. That second job, speaker diarization, labels the turns so a two-person interview reads as a back-and-forth rather than one undivided block of text.
The three transcription styles: verbatim, clean, and edited
How you transcribe changes what the transcript can be used for. In qualitative research the method is treated as consequential for the analysis itself, not a formatting preference (Oliver, Serovich & Mason, 2005). The same hour of audio produces a different document depending on the style you choose.
Strict verbatim, sometimes called naturalistic, keeps every utterance: the false starts, the "you know," the half-second stammer. Clean verbatim strips the filler and repetitions but keeps the speaker's actual words. Edited or intelligent verbatim goes one step further and tidies grammar so a quote reads smoothly on the page.
The gap between these styles maps onto what researchers call naturalism versus denaturalism: capture every detail, or standardize the grammar and smooth non-standard accents. Which one is right depends on your purpose. The full comparison with a when-to-use rule is worth reading before you commit a whole project to one style.
Is AI now as accurate as a human?
On clean conversational speech, it's close. Professional human transcribers reach a 5.9% word-error rate on the Switchboard benchmark, and a Microsoft Research system reached 5.8% on the same audio (Xiong et al., 2016). At that level you're editing a draft, not retyping one, though how accurate AI transcription really is rarely matches a clean benchmark in the field.
Time is why the AI-first workflow took hold. Transcribing one hour of audio by hand can take up to six hours of manual work (Haberl et al., 2023). An automatic first pass turns most of that day into a few minutes of processing, plus a focused correction pass over names, jargon, and fast numbers.
Parity on a benchmark hides where AI still fails. One study of five commercial systems found word-error rates roughly double for Black speakers versus white speakers, 0.35 against 0.19 (Koenecke et al., 2020). Accent, poor audio, and crosstalk all degrade a machine faster than they degrade a careful person. The fuller human-versus-AI tradeoff covers when to trust which.
Where transcription is used, and who does it
Transcription runs from courtrooms to podcast show notes, and some of it is required by law. U.S. federal court proceedings must be recorded verbatim by statute under 28 U.S.C. § 753. Accessibility standards cover the everyday end: WCAG 2.1 requires a text alternative for prerecorded audio, which in practice means a transcript.
The work supports a real profession. The U.S. Bureau of Labor Statistics counts about 17,700 court reporters and simultaneous captioners, with a median wage of $67,310 in May 2024. Legal, medical, media, research, and accessibility teams treat an accurate transcript as part of routine work.
The field is being reshaped by the same technology that now drafts the transcript. BLS projects medical transcriptionist employment to decline 5% from 2024 to 2034, and ties the drop to speech recognition and natural language processing. The job is shifting from typing every word to checking and correcting what a model produced.
So which transcription method should you use?
There's a simple rule for choosing: match the effort to the cost of a wrong word. A searchable archive of old recordings can ride on the raw machine draft, since nobody is quoting it line by line. A published quote or a legal record needs a person to read the draft against the audio first, because that's where a plausible-but-wrong word does real damage.
For most people the practical answer is the hybrid one. Let a tool draft the transcript, then spend your attention on the handful of lines that carry weight. If your next task is an actual recording, the step-by-step interview workflow walks through recording, drafting, and cleaning the quotes you'll publish.