Guide

What is transcription?

A plain-English explainer for researchers, journalists, and anyone deciding whether to type a recording out by hand or let software do the first pass.

The short answer

Transcription is the process of turning spoken words into written text, and the written record it produces. A transcript can be strict verbatim, keeping every "um" and false start, or cleaned up for readability. The work is done by human typists, by automatic speech recognition, or by an AI draft that a person then corrects.

Transcription, defined

Transcription is the act of representing spoken words in written or printed form, and the written copy that results. The Oxford Advanced Learner's Dictionary defines a transcription as a written or printed copy of words that have been spoken. The term names both a process and its product: you transcribe a recording, and the file you end up with is the transcript.

It helps to separate transcription from two things people mix it up with. Translation swaps one language for another; transcription keeps the same language and simply writes it down. Speech recognition is the technology that listens and predicts the words – one way to produce a transcript, not another name for the job.

There's a related task worth separating out too. Turning speech into words is distinct from working out who said each line. That second job, speaker diarization, labels the turns so a two-person interview reads as a back-and-forth rather than one undivided block of text.

The three transcription styles: verbatim, clean, and edited

How you transcribe changes what the transcript can be used for. In qualitative research the method is treated as consequential for the analysis itself, not a formatting preference (Oliver, Serovich & Mason, 2005). The same hour of audio produces a different document depending on the style you choose.

Strict verbatim, sometimes called naturalistic, keeps every utterance: the false starts, the "you know," the half-second stammer. Clean verbatim strips the filler and repetitions but keeps the speaker's actual words. Edited or intelligent verbatim goes one step further and tidies grammar so a quote reads smoothly on the page.

The gap between these styles maps onto what researchers call naturalism versus denaturalism: capture every detail, or standardize the grammar and smooth non-standard accents. Which one is right depends on your purpose. The full comparison with a when-to-use rule is worth reading before you commit a whole project to one style.

Is AI now as accurate as a human?

On clean conversational speech, it's close. Professional human transcribers reach a 5.9% word-error rate on the Switchboard benchmark, and a Microsoft Research system reached 5.8% on the same audio (Xiong et al., 2016). At that level you're editing a draft, not retyping one, though how accurate AI transcription really is rarely matches a clean benchmark in the field.

Time is why the AI-first workflow took hold. Transcribing one hour of audio by hand can take up to six hours of manual work (Haberl et al., 2023). An automatic first pass turns most of that day into a few minutes of processing, plus a focused correction pass over names, jargon, and fast numbers.

Parity on a benchmark hides where AI still fails. One study of five commercial systems found word-error rates roughly double for Black speakers versus white speakers, 0.35 against 0.19 (Koenecke et al., 2020). Accent, poor audio, and crosstalk all degrade a machine faster than they degrade a careful person. The fuller human-versus-AI tradeoff covers when to trust which.

Where transcription is used, and who does it

Transcription runs from courtrooms to podcast show notes, and some of it is required by law. U.S. federal court proceedings must be recorded verbatim by statute under 28 U.S.C. § 753. Accessibility standards cover the everyday end: WCAG 2.1 requires a text alternative for prerecorded audio, which in practice means a transcript.

The work supports a real profession. The U.S. Bureau of Labor Statistics counts about 17,700 court reporters and simultaneous captioners, with a median wage of $67,310 in May 2024. Legal, medical, media, research, and accessibility teams treat an accurate transcript as part of routine work.

The field is being reshaped by the same technology that now drafts the transcript. BLS projects medical transcriptionist employment to decline 5% from 2024 to 2034, and ties the drop to speech recognition and natural language processing. The job is shifting from typing every word to checking and correcting what a model produced.

So which transcription method should you use?

There's a simple rule for choosing: match the effort to the cost of a wrong word. A searchable archive of old recordings can ride on the raw machine draft, since nobody is quoting it line by line. A published quote or a legal record needs a person to read the draft against the audio first, because that's where a plausible-but-wrong word does real damage.

For most people the practical answer is the hybrid one. Let a tool draft the transcript, then spend your attention on the handful of lines that carry weight. If your next task is an actual recording, the step-by-step interview workflow walks through recording, drafting, and cleaning the quotes you'll publish.

Tips from people who do this a lot

Decide your verbatim style before the first edit, not after – switching halfway means re-checking every line you already touched.
A transcript and a translation are different deliverables. If you need the words in another language, that's a second step after transcription.
Speaker labels are a separate task from the words themselves. If who-said-what matters for citation, confirm the tool diarizes; don't assume clean text means clean turns.
AI parity numbers come from clean, studio-grade audio. Budget more correction time for phone recordings, heavy accents, and any moment two people talk over each other.
Keep the original recording after you transcribe. A transcript is a representation, not the evidence itself – for legal, research, or fact-checking work you may need to re-listen to the source.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

What is transcription – questions, answered

What is the difference between transcription and translation?

Transcription writes down what was said in the same language the speaker used. Translation renders that content into a different language. They're separate jobs, often done in sequence: you transcribe the recording first, then translate the finished transcript if you need another language. One is about capture, the other about conversion.

Is a transcript the same as verbatim?

No. Verbatim is one style of transcript, the kind that keeps every 'um,' false start, and repetition. Clean verbatim removes the filler while keeping the real words, and edited verbatim tidies grammar for readability. All three are transcripts. Verbatim just describes how faithfully the text mirrors the exact speech.

How accurate is transcription?

Professional human transcribers reach about a 5.9% word-error rate on clean conversational speech (Xiong et al., 2016), and strong AI systems now match that on the same benchmark. Accuracy drops sharply with background noise, heavy accents, or crosstalk, so real-world results sit below the benchmark for both people and machines.

Who uses transcription?

Courts, doctors, journalists, qualitative researchers, podcasters, and accessibility teams all use it. Some uses are mandated: U.S. federal courts must record proceedings verbatim by statute, and accessibility standards require a text alternative for prerecorded audio. The Bureau of Labor Statistics counts about 17,700 court reporters and simultaneous captioners in the United States.

Can I just use automatic transcription, or do I still need a human?

It depends on the stakes. For searchable notes or a rough draft, an automatic transcript is usually enough. For published quotes, legal records, or research coding, a person should check the output, because word-error rates roughly double for some speaker groups (Koenecke et al., 2020). The common workflow is AI draft, human correction.

References

1.Oxford Advanced Learner's Dictionary – 'transcription' – Oxford University Press
2.Oliver, Serovich & Mason (2005), Constraints and Opportunities with Interview Transcription – Social Forces (Oxford University Press)
3.Haberl et al. (2023), Take the aTrain – manual transcription time cost, citing Bell et al. (2018) – arXiv / University of Graz
4.Xiong et al. (2016), Achieving Human Parity in Conversational Speech Recognition – arXiv / Microsoft Research
5.Koenecke et al. (2020), Racial disparities in automated speech recognition – Proceedings of the National Academy of Sciences (PNAS)
6.Federal Court Reporting Program – verbatim recording requirement (28 U.S.C. § 753) – Administrative Office of the U.S. Courts
7.Understanding WCAG 2.1 SC 1.2.1 – text alternative for prerecorded audio (Level A) – World Wide Web Consortium (W3C) / Web Accessibility Initiative
8.Occupational Outlook Handbook – Court Reporters & Simultaneous Captioners – U.S. Bureau of Labor Statistics
9.Occupational Outlook Handbook – Medical Transcriptionists – U.S. Bureau of Labor Statistics

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing