Transcription vs translation, two different jobs
Transcription and translation solve different problems. Transcription writes down speech in the language it was spoken – audio in, same-language text out. Translation carries meaning from one language into another. Roman Jakobson's 1959 essay names the split precisely: intralingual translation, or 'rewording,' stays inside one language, while interlingual translation – 'translation proper' – moves between languages (Jakobson 1959).
The professional world sorts these language jobs the same way. The U.S. Bureau of Labor Statistics defines the field so interpreters work in spoken or sign language and translators in written language, both converting one language into another. Transcription isn't on that list, because it never crosses languages. It stays in one.
A quick test tells you which one you're looking at. If the language stays put and only the medium changes, sound to text, that's transcription. For the verbatim styles and who relies on them, see what transcription is. If the language itself changes, that's translation – a separate skill with its own training and credentials.
How each one works, and why the errors differ
The two jobs fail in different ways, which tells you what each really does. Transcription's hard part is hearing correctly. On the Switchboard conversational-speech benchmark, professional human transcribers reached a 5.9% word error rate and one automated system 5.8% (Xiong et al., Microsoft Research, 2016). The mistakes that remain are mishearings: a wrong word, a dropped word, the wrong speaker.
Translation fails on meaning, not sound. Machine translation can read fluently and still be wrong about what a sentence means. In a controlled study, raters preferred human over machine translation more strongly when judging whole documents than isolated sentences (Läubli, Sennrich & Volk, EMNLP 2018). Errors invisible in a single sentence became decisive once the whole document was in view.
Under the hood they take different inputs. Transcription is speech-to-text: a model, or a person, maps sound to words. Translation starts from text that already exists and re-expresses it in another language. One listens; the other reads. You can't hand raw audio to a translator and skip a step, because nothing is written yet to translate.
Do you transcribe first, then translate?
Yes – in almost every real workflow, transcription comes first and translation second. Audio isn't written text, and translation works on written text, so you produce a source-language transcript, then translate that. Going straight from foreign-language audio to English text is really two jobs stacked: listen and write it down, then carry the meaning across.
Order matters inside the file, too. When you translate a transcript, keep it aligned segment by segment so timestamps and speaker labels line up with the new text. A transcript translator that preserves timing and speaker labels beats pasting the whole thing into a general translator, which flattens the structure. When you reach that stage, how to translate a transcript walks the actual steps.
Because translation runs on the transcript, any transcription error carries straight through. A misheard name or a wrong number becomes wrong in every translated copy. It pays to correct the source-language transcript first – proper nouns, figures, speaker turns – before you translate a single line.
Captions come from transcription, subtitles from translation
The clearest everyday example sits on a video player. The W3C Web Accessibility Initiative defines captions as a same-language text version of the speech and non-speech audio. By that definition, subtitles are spoken audio translated into another language for viewers who can hear but don't know the language. Captions are a transcription output; subtitles are a translation output.
That's why they aren't interchangeable menu options. Captions serve a viewer who shares the language but needs the audio spelled out, including speaker labels and cues like [laughter] or [phone rings]. Subtitles serve a viewer who hears fine but doesn't speak the source language. Same video, two different problems, solved by the two different jobs above.
The order holds here as well. You caption from a same-language transcript, and if you then need another language, you translate those captions into subtitles. Transcribe once; translate as many times as you have target languages.
Where the law draws the line, and which you need
Regulators treat the two as separate, credentialed disciplines. Any foreign-language document filed with U.S. immigration must arrive with a full English translation the translator certifies as complete and accurate, plus certification that they're competent to translate it (8 CFR 103.2(b)(3)). Since a person has to vouch for accuracy and competence, raw machine output doesn't meet that bar on its own.
Spoken cross-language work carries its own credential. The Court Interpreters Act tells federal courts to use the most available certified interpreter in proceedings the United States brings (28 U.S.C. 1827). Only when no certified interpreter is reasonably available may the court fall back to an otherwise qualified one. That's interpreting: spoken, across languages, a third discipline apart from transcription and written translation.
On the transcription side, accessibility rules ask for the text, not another language. WCAG 2.1 requires an alternative for time-based media that presents equivalent information for prerecorded audio-only content (Success Criterion 1.2.1, Level A). In practice, that alternative is a transcript.
So which do you need? If your recording and your audience share a language, the job is transcription – a same-language transcript, captions if it's video. If your audience reads a different language, you need both, in order, transcribe first and translate second. Name the job right and the rest follows: who you hire, and which tool you reach for.