Pepys

Guide

How to add subtitles to a video

A working guide for anyone captioning their own footage – readable and compliant, not auto-caption soup.

The short answer

To add subtitles to a video, transcribe the audio into a timed text file – SRT or WebVTT – then either load it as a sidecar track your player toggles on, or burn it into the picture. Start with an AI transcript, correct the words and timing by hand, keep lines under about 37 characters and two per cue, then export SRT or VTT.

Subtitles or captions – which do you actually need?

Subtitles and captions use the same file formats but carry different content. Subtitles render spoken dialogue as text, often for viewers who can hear but don't share the language. Captions add speaker labels and non-speech sound – [music], [applause], a door slamming – for deaf and hard-of-hearing viewers. If accessibility is the goal, you want captions, not bare subtitles.

That distinction has legal weight. WCAG 2.1 Success Criterion 1.2.2 requires captions for all prerecorded audio in synchronized media at Level A, the baseline conformance level. In the US, Section508.gov adopts that same criterion for federal and public video. So for an institutional or public-facing video, "add subtitles" really means "add compliant captions."

Even outside compliance, subtitles widen who can watch. People view muted in offices and on trains, and non-native speakers read along to keep up. Text on screen is the cheapest reach you'll buy. Decide up front which you're making, because it changes what goes in each cue.

Why auto-captions won't cut it on their own

Auto-generated captions save typing but miss the accuracy bar. A university accessibility office states it plainly: automated captions aren't sufficient for public content or accommodation requests without human editing. So the machine draft needs correction before you ship it.

Typing from scratch, though, is brutal. Manual transcription can take up to six hours for a single hour of audio – most of a working day for one video. The workable path splits the job: an AI first pass handles the bulk, then you edit. You're correcting, not re-transcribing.

Spend that editing time where ASR fails. Proper nouns, acronyms, numbers said quickly, and punctuation that changes meaning are the usual suspects, and they're exactly what a viewer notices on screen. For a recorded talk or spoken-word footage, the same first-pass-then-fix workflow applies, and our lecture transcription guide walks the spoken-word specifics.

SRT, WebVTT, or burned into the picture?

Subtitles reach the screen two ways. A sidecar file rides alongside the video and the player toggles it on; burned-in (open) subtitles are painted into the pixels and can't be switched off. WebVTT is the web-native standard here: it's a published W3C specification and the format the HTML5 <track> element reads.

SRT (SubRip) is a plain-text timed format that players, editors, and social platforms read widely – export it when you need a caption file to upload with your video, using an audio-to-SRT export. WebVTT is the choice for a <track> on your own site, from an audio-to-VTT export. Both are sidecar files; keep one as your editable master.

Burn-in is the fallback. For feeds that autoplay muted or ignore uploaded caption files, rendering the subtitles into the frame guarantees they show. The trade-off: you lose the viewer's toggle, and fixing one typo means re-rendering the whole clip. So keep the sidecar file as the source of truth and burn a copy only when a platform forces it.

Format subtitles people can actually read

Readable subtitles hold to a reading speed and a line budget. The BBC Subtitle Guidelines recommend 160–180 words per minute, a 37-character line-length limit, and a maximum of two lines per cue for landscape or square video. Push past that and viewers can't finish a line before it changes on them.

Break lines at natural clause boundaries, never mid-phrase. Don't split a name across cues or strand a preposition from its noun. Let each cue sit long enough to read – even a short line wants roughly a second on screen. Cramming three lines to fit a long sentence just loses people.

Sync matters as much as wording. Subtitles should appear as the words are spoken and clear soon after. An off-by-a-second cue is more distracting than a small wording slip, so spot-check the timing against the audio using your transcript's timestamps before you export.

How to add the subtitles to your video

Attaching a sidecar track is the clean route. For your own site, reference the .vtt file in a <track> element inside <video>, and the browser draws the toggle. For YouTube, Vimeo, or social, upload the SRT in the caption settings. There's no re-encoding, and viewers keep control of whether text shows.

Burn in when you must. An editor (or ffmpeg) renders the subtitle file into the video for platforms that won't take a sidecar. Style for legibility: high-contrast text, a slight shadow or backing box, positioned bottom-center and clear of any lower-third graphics.

Either way, do the transcript first. You can get a timed, editable transcript in minutes, correct it, then export SRT or VTT, rather than typing and timing every cue by hand. The transcript is the real work; the subtitle file is just its export.

The steps, in order

  1. 01

    Get a timed transcript

    Upload the video or its audio for an AI first pass and get a timestamped, speaker-labeled draft in minutes instead of hours of typing from scratch.

  2. 02

    Correct the words and timing

    Read the draft against the audio. Fix names, jargon, numbers, and punctuation, and nudge cue timing so the text appears as the words are spoken.

  3. 03

    Format cues for readability

    Keep lines to about 37 characters, two lines per cue at most, and a reading speed near 160–180 words per minute. Break lines at clause boundaries.

  4. 04

    Export SRT or WebVTT

    Save an SRT for social platforms and most players, or WebVTT for your own site's HTML5 track. Keep one file as your editable master.

  5. 05

    Attach or burn in

    Load the sidecar file as a caption track, or burn it into the picture for platforms that autoplay muted or ignore uploaded caption files.

Tips from people who do this a lot

  • For a public or institutional video, treat the job as captions: include speaker IDs and key non-speech sounds so it meets the accessibility bar.

  • Keep the SRT or VTT as your source of truth. Burn a copy into the video only when a platform demands it, so a typo fix doesn't mean re-rendering the master.

  • Break subtitle lines at clause boundaries, never mid-phrase. A line that splits "the Ministry of / Health" reads worse than one a hair longer.

  • Never trust a raw auto-caption on a name or a number. Those are exactly where ASR fails and where a wrong subtitle is most visible on screen.

  • Spot-check timing against your timestamps. An off-by-a-second subtitle distracts viewers more than a small wording slip does.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

How to add subtitles to a video – questions, answered

What's the difference between subtitles and captions?

Subtitles render spoken dialogue as text, often for viewers who can hear but don't share the language. Captions add speaker labels and non-speech sound like [music] or [applause] for deaf and hard-of-hearing viewers. They use the same file formats, but captions carry more. For accessibility, you want captions.

Can I just use YouTube's automatic subtitles?

For a quick personal clip, maybe. For anything public or institutional, no. University accessibility offices are clear that automatic captions are not accurate enough on their own for public or accommodation content. Automated captions need human editing before you can rely on them.

Which subtitle format should I use, SRT or WebVTT?

Use SRT for social platforms and most players, since it's plain-text and widely read. Use WebVTT for your own website, because it's the W3C format the HTML5 track element reads. Both are sidecar files you toggle on; keep one as your editable master and export the other when needed.

How long should each subtitle stay on screen?

Long enough to read comfortably. The BBC recommends a reading speed of 160–180 words per minute, lines up to 37 characters, and two lines per cue at most. Faster than that and viewers can't finish before the text changes. Even a short line should hold about a second.

Do subtitles have to match the audio word for word?

For captions, keep the speaker's actual words; light edits for readability are fine, but don't rewrite meaning. Match the timing so text appears as it's spoken, and mark sounds that matter to the story. Accuracy and sync matter more than perfect verbatim styling.

References

  1. 1.WCAG 2.1 Understanding SC 1.2.2: Captions (Prerecorded), Level AW3C Web Accessibility Initiative (WAI)
  2. 2.Synchronized Media – adopts WCAG SC 1.2.2 for federal/public videoU.S. General Services Administration / Section508.gov
  3. 3.Identifying automated vs human-edited captionsUniversity of Colorado Boulder, Digital Accessibility Office
  4. 4.How to create and edit accurate YouTube captionsPurdue University, College of Liberal Arts
  5. 5.BBC Subtitle Guidelines (reading rate, line length, line count)BBC
  6. 6.WebVTT: The Web Video Text Tracks FormatW3C
  7. 7.HTML <track> element – WebVTT is the track formatMDN Web Docs (Mozilla)
  8. 8.Haberl et al. (2023), Take the aTrain – transcription time cost, citing Bell et al. (2018)arXiv / University of Graz

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.