Guide

Transcription vs translation

A plain-English guide to what separates writing speech down from carrying it into another language – and why you almost always do the first before the second.

The short answer

Transcription turns speech into written text in the same language; translation converts text from one language into another. They run in sequence, not in place of each other: you transcribe a recording first to get source-language text, then translate that text if your audience needs another language. Transcription errors are mishearings; translation errors are meaning-level. Captions are same-language; subtitles are translated.

Transcription vs translation, two different jobs

Transcription and translation solve different problems. Transcription writes down speech in the language it was spoken – audio in, same-language text out. Translation carries meaning from one language into another. Roman Jakobson's 1959 essay names the split precisely: intralingual translation, or 'rewording,' stays inside one language, while interlingual translation – 'translation proper' – moves between languages (Jakobson 1959).

The professional world sorts these language jobs the same way. The U.S. Bureau of Labor Statistics defines the field so interpreters work in spoken or sign language and translators in written language, both converting one language into another. Transcription isn't on that list, because it never crosses languages. It stays in one.

A quick test tells you which one you're looking at. If the language stays put and only the medium changes, sound to text, that's transcription. For the verbatim styles and who relies on them, see what transcription is. If the language itself changes, that's translation – a separate skill with its own training and credentials.

How each one works, and why the errors differ

The two jobs fail in different ways, which tells you what each really does. Transcription's hard part is hearing correctly. On the Switchboard conversational-speech benchmark, professional human transcribers reached a 5.9% word error rate and one automated system 5.8% (Xiong et al., Microsoft Research, 2016). The mistakes that remain are mishearings: a wrong word, a dropped word, the wrong speaker.

Translation fails on meaning, not sound. Machine translation can read fluently and still be wrong about what a sentence means. In a controlled study, raters preferred human over machine translation more strongly when judging whole documents than isolated sentences (Läubli, Sennrich & Volk, EMNLP 2018). Errors invisible in a single sentence became decisive once the whole document was in view.

Under the hood they take different inputs. Transcription is speech-to-text: a model, or a person, maps sound to words. Translation starts from text that already exists and re-expresses it in another language. One listens; the other reads. You can't hand raw audio to a translator and skip a step, because nothing is written yet to translate.

Do you transcribe first, then translate?

Yes – in almost every real workflow, transcription comes first and translation second. Audio isn't written text, and translation works on written text, so you produce a source-language transcript, then translate that. Going straight from foreign-language audio to English text is really two jobs stacked: listen and write it down, then carry the meaning across.

Order matters inside the file, too. When you translate a transcript, keep it aligned segment by segment so timestamps and speaker labels line up with the new text. A transcript translator that preserves timing and speaker labels beats pasting the whole thing into a general translator, which flattens the structure. When you reach that stage, how to translate a transcript walks the actual steps.

Because translation runs on the transcript, any transcription error carries straight through. A misheard name or a wrong number becomes wrong in every translated copy. It pays to correct the source-language transcript first – proper nouns, figures, speaker turns – before you translate a single line.

Captions come from transcription, subtitles from translation

The clearest everyday example sits on a video player. The W3C Web Accessibility Initiative defines captions as a same-language text version of the speech and non-speech audio. By that definition, subtitles are spoken audio translated into another language for viewers who can hear but don't know the language. Captions are a transcription output; subtitles are a translation output.

That's why they aren't interchangeable menu options. Captions serve a viewer who shares the language but needs the audio spelled out, including speaker labels and cues like [laughter] or [phone rings]. Subtitles serve a viewer who hears fine but doesn't speak the source language. Same video, two different problems, solved by the two different jobs above.

The order holds here as well. You caption from a same-language transcript, and if you then need another language, you translate those captions into subtitles. Transcribe once; translate as many times as you have target languages.

Where the law draws the line, and which you need

Regulators treat the two as separate, credentialed disciplines. Any foreign-language document filed with U.S. immigration must arrive with a full English translation the translator certifies as complete and accurate, plus certification that they're competent to translate it (8 CFR 103.2(b)(3)). Since a person has to vouch for accuracy and competence, raw machine output doesn't meet that bar on its own.

Spoken cross-language work carries its own credential. The Court Interpreters Act tells federal courts to use the most available certified interpreter in proceedings the United States brings (28 U.S.C. 1827). Only when no certified interpreter is reasonably available may the court fall back to an otherwise qualified one. That's interpreting: spoken, across languages, a third discipline apart from transcription and written translation.

On the transcription side, accessibility rules ask for the text, not another language. WCAG 2.1 requires an alternative for time-based media that presents equivalent information for prerecorded audio-only content (Success Criterion 1.2.1, Level A). In practice, that alternative is a transcript.

So which do you need? If your recording and your audience share a language, the job is transcription – a same-language transcript, captions if it's video. If your audience reads a different language, you need both, in order, transcribe first and translate second. Name the job right and the rest follows: who you hire, and which tool you reach for.

Tips from people who do this a lot

Fix the transcript before you translate. Every misheard name or wrong number in the source text becomes an error in every language you translate it into.
For a video in one language, you want captions, same-language with sound cues; reach for subtitles only when the viewer doesn't speak the source language.
Machine translation reads smoothly even when it's wrong. The errors hide at the meaning level, so review translated documents whole, not one sentence at a time.
Keep the translation aligned to the original segment by segment. It's the only way timestamps and speaker labels survive the switch into another language.
For a court or an immigration office, budget for a certified human translator or interpreter. Machine output won't satisfy the certification those filings require.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

Transcription vs translation – questions, answered

Is transcription the same as translation?

No. Transcription writes speech down in the same language it was spoken; translation moves text from one language into another. Transcription changes the medium, sound to text; translation changes the language. They're often used together, transcription first, but they're separate jobs with separate skills.

Do you transcribe or translate first?

Transcribe first. Translation works on written text, and audio isn't text, so you produce a source-language transcript, then translate it. Working in that order also lets you fix misheard names and numbers once, before those errors carry into every translated version.

What's the difference between captions and subtitles?

Captions are same-language text of the speech plus non-speech sounds, for viewers who share the language but need the audio spelled out. Subtitles are dialogue translated into another language, for viewers who can hear but don't speak the source language (W3C WAI). Captions come from transcription; subtitles from translation.

Can machine translation replace a human translator?

For rough understanding, often. For anything certified, no. Machine translation reads fluently but still makes meaning-level errors, and U.S. immigration filings require a human translator to certify a translation is complete and accurate. Federal courts likewise require certified interpreters for spoken proceedings the government brings.

Is a transcript legally required for audio content?

For accessibility, effectively yes. WCAG 2.1 (Success Criterion 1.2.1, Level A) requires an alternative for time-based media for prerecorded audio-only content, which in practice means a transcript. That's a transcription requirement, in the same language, not a translation one.

References

1.Interpreters and Translators – interpreter (spoken/sign) vs translator (written) – U.S. Bureau of Labor Statistics – Occupational Outlook Handbook
2.Jakobson (1959), On Linguistic Aspects of Translation – intralingual vs interlingual – Roman Jakobson (1959), reproduced in Munday, Introducing Translation Studies (Routledge, 2016), p.9
3.Xiong et al. (2016), Achieving Human Parity in Conversational Speech Recognition – Switchboard WER – Microsoft Research (arXiv:1610.05256)
4.Läubli, Sennrich & Volk (2018), Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation – EMNLP – ACL Anthology D18-1512
5.Captions/Subtitles – captions (same-language) vs subtitles (translated) – W3C Web Accessibility Initiative (WAI)
6.WCAG 2.1 Success Criterion 1.2.1, Audio-only and Video-only (Prerecorded), Level A – W3C Web Content Accessibility Guidelines 2.1
7.8 CFR 103.2(b)(3) – certified full English translation for immigration filings – Cornell Legal Information Institute
8.28 U.S.C. 1827 (Court Interpreters Act) – certified interpreters in federal proceedings – Cornell Legal Information Institute

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing