Transcription for accessibility, done right
Drop the video or paste the link – get a clean, speaker-labeled transcript you can correct in minutes, then export the caption files and the on-page transcript your viewers actually need.
60 min free · no card required · we never train on your audio
How do you transcribe videos?
Transcription for accessibility means turning a video into a corrected, time-coded transcript you can export as caption files (SRT and VTT) plus a readable on-page transcript. Pepys returns a speaker-labeled draft in minutes that you fix and ship, so captions are accurate rather than auto-generated guesses. It's pay-as-you-go with no subscription, and credits never expire.
Made for accessibility teams
Your team is on the hook for making video usable by everyone – the deaf viewer who needs accurate captions, the screen-reader user who needs a transcript, the auditor who needs proof it was actually done. The trap is the automatic caption track that looks finished and isn't: it mangles names, drops numbers, and runs three panelists together as one block, and nobody catches it until a complaint does. You don't need to type every word from silence. You need a clean, time-coded draft you can correct and ship as captions and a transcript – so access becomes how you publish instead of a cleanup you keep redoing.
In practice the correction pass is where transcription for accessibility lives or dies: you scrub for the proper nouns and figures auto-captions fumble, then confirm the timing lines up before anything ships. Word-level timestamps put each fix where the caption breaks, speaker labels split a panel into named voices instead of one wall, and full-text search jumps you to the term a complaint flagged. From that same draft you export the SRT and VTT the player reads and the on-page transcript a screen reader announces – one source, both deliverables.
Clean paragraphs. No more um's and ah's.
The left is what Pepys hands back – logical paragraphs with the filler stripped out, punctuated and readable. The right is the raw, one-line-per-segment dump most transcribers leave you with.
um so yeah everyone keeps telling you to like lead with your best line right but uh honestly if you give away the whole answer in the first second you know there's basically no reason for anyone to keep watching so the hook isn't kind of the smartest thing you say it's like a loop you open that they need to close and um that's the part that actually keeps people around
RawCorrected caption files
A time-coded transcript you fix in minutes, then export as SRT and VTT – accurate captions instead of an auto track nobody checked.
On-page transcript
The full readable transcript to publish under each video for screen-reader users – and for the search engines that index the page.
Speaker-labeled panels
Multi-speaker talks come back with each voice separated, so a panel or interview reads as distinct people rather than one undifferentiated wall.
Back-catalogue captioning
A whole archive of legacy video turned into correctable drafts, so a small team can work through hundreds of clips instead of typing each from silence.
Built in, not bolted on
A corrected transcript, caption files, and a readable on-page version
Every videois analyzed automatically the moment it’s transcribed. Here’s a real sample, run through it.
Why Auto-Captions Aren't Accessible (and How to Fix It)
In this staff captioning clinic, a digital accessibility specialist makes the case that an unchecked auto-caption track is not an accessible video – it fails exactly on the names, numbers, and noisy passages where accuracy decides meaning. The fix is to start from a real, time-coded transcript, correct the words and speaker labels, then export it as a caption file, which is the only realistic way a small team clears a large back catalogue. The session also draws the line between captions (timed, on-screen) and a transcript (the full readable document), argues that speaker labels are comprehension rather than polish, and closes on building access into publishing so it stops being a one-time cleanup that rots.
Key points
- An auto-caption track that nobody checked is not an accessible video – auto-captions fail exactly where it matters: names, technical terms, numbers, and anything said over background noise.
- Don't treat accuracy as a percentage: a deaf viewer doesn't experience eighty-percent-correct as eighty percent of the meaning, because one wrong word in a sentence can flip what the whole sentence means.
- Work from a corrected draft, not the green checkmark: you start from a real transcript and you correct it, fix the names and the numbers, then export it as a caption file – the only way three people clear four hundred videos.
- Captions and transcripts are different deliverables: captions are timed to the video and appear on screen as it plays, while a transcript is the full text as one readable document – you want both.
- Speaker labels are comprehension, not a nicety: when several people are speaking, an undifferentiated wall of caption text loses the thread, so mark who is speaking.
- The closing principle: caption for meaning, not just for words, and build access into publishing so it becomes how you publish instead of a one-time cleanup that quietly rots.
Clean, speaker-labeled, click-to-seek
Ask, don’t scrub
Ask the transcript anything.
An hour-long recording? Don’t skim it – ask. Every answer stays grounded in your transcript and cites the exact timestamp, so you can jump to the moment and check it yourself.
Why isn't turning on the platform's automatic captions enough for accessibility?
The facilitator calls an auto-caption track that nobody checked not an accessible video, just a liability with a green checkmark next to it. He says auto-captions fail exactly where it matters – names, technical terms, numbers, and anything said over background noise – and that a deaf viewer doesn't experience eighty-percent-correct as eighty percent of the meaning, because one wrong word can flip the whole sentence.
What's the difference between captions and a transcript, and do speaker labels matter?
Captions are timed to the video and appear on screen as it plays, while a transcript is the full text as one readable document with no timing – and you want both. On labels, he's blunt that they're comprehension, not a nicety: an undifferentiated wall of caption text loses the thread when several people are speaking, so mark who is talking.
Grounded in your transcript – if the answer isn’t in the audio, it says so instead of guessing.
Who said what
Speaker labels that survive cross-talk
Automatic speaker diarization. Two people, four people, cross-talk and interruptions – interviews, panels, messy meetings. Pepys keeps each voice on its own line instead of blurring them into one, so you never rewind to figure out who was talking.
So the festival nearly didn't happen this year–
–it almost didn't. We lost the venue three weeks out.
Three weeks? How do you even start to–
You call everyone you know. The whole town pitched in.
And that's how it ended up in the park.
Record in any language – 99+ detected automatically
- English
- 中文
- Español
- العربية
- हिन्दी
- Français
- 日本語
- Português
- Русский
- Deutsch
- 한국어
- Italiano
- বাংলা
- Türkçe
- فارسی
- Tiếng Việt
- தமிழ்
- Polski
- ไทย
- Українська
- Nederlands
- עברית
- Ελληνικά
- తెలుగు
- Bahasa Indonesia
- اردو
- Svenska
- मराठी
- Română
- Magyar
- Čeština
- ગુજરાતી
- Kiswahili
- ქართული
- Tagalog
- አማርኛ
Works with the platforms you live in.
Paste a link from YouTube, TikTok, Instagram, Facebook, Spotify, or Apple Podcasts – or drop in any audio or video file. We transcribe it once, then you export it however your workflow needs.
- YouTube
- TikTok
- Spotify
- Apple Podcasts
- or any file
Export to any format
- TXT
- Markdown
- DOCX
- SRT
- VTT
- JSON
Most useful for accessibility teams: SRT · VTT · Transcript (TXT) · DOCX · JSON
Timestamps, speaker labels, and subtitle timing carry through to every export.
How transcription for accessibility works
Upload or paste a link
Drop your video or paste its link – any audio or video, in any language.
Get your transcript
A clean, speaker-labeled transcript with AI notes tuned to your format, ready in minutes.
Edit and export
Fix anything inline, then export to SRT, VTT, TXT, DOCX, PDF, or JSON.
Why accessibility teams pick Pepys
No subscription – pay per video you caption, and the credits never expire while you work through a back catalogue.
You get a correctable draft, not a locked auto-caption track – fix the names and numbers, then export, so captions are accurate by the time they ship.
Captions and a readable transcript come out of one pass: SRT and VTT for the player, plain text and DOCX for the page.
Speaker labels keep a multi-person panel separated, so a deaf viewer can follow who is speaking instead of reading one merged block.
What accessibility teams say
I transcribe in the original language and receive a translated version with the subtitles still intact. It saved an entire round of contractor work on my last film. Thank you for building this.
Giulia F.Documentary filmmaker · email every module comes back captioned with a handout written from the transcript. launch prep went from a week to an afternoon, wish id found this sooner honestly.
Alina M.Course creator · Reddittranscribe once, deliver in another language with the timing preserved. the part of subtitling i used to dread is just... done now.
Lucas D.Subtitle translator · X
Transcription for accessibility – questions, answered
How is this different from the automatic captions my platform already generates?
Auto-captions are a first guess that fails on names, technical terms, numbers, and anything said over background noise – and nobody usually checks them. Pepys gives you a clean, time-coded transcript you correct in minutes and then export as a caption file, so what ships is accurate rather than an unchecked guess.
What's the difference between captions and a transcript, and can I get both?
Captions are timed to the video and appear on screen as it plays; a transcript is the full text as one readable document with no timing required. You want both, and you get both from a single pass: export SRT and VTT caption files for the player, and a TXT or DOCX transcript to publish on the page.
Can it tell speakers apart on a panel or interview?
Yes. Speaker diarization separates each voice, so a two- or three-person video comes back labeled rather than as one undifferentiated wall of text. You can rename a speaker once and it updates everywhere, which matters when a viewer needs to follow who is talking.
We have a large back catalogue and a small team. How do we get through it?
You correct drafts instead of typing from silence. Each video comes back as a time-coded transcript you fix – names, numbers, the noisy moments – and then export. Editing a good draft is a different, faster job than transcribing from scratch, which is how a small team clears hundreds of legacy videos. Credits never expire, so you can work at the pace your funding allows.
What caption and transcript formats can I export?
SRT and VTT caption files that drop straight into your player, plus a plain-text or DOCX transcript for the page, and JSON if you need to pipe it into another system. One click each.
Does it handle other languages for multilingual content?
Yes. It auto-detects the spoken language across 99+ languages, so a video in another language transcribes without you changing a setting. You can also transcribe in the original language and get a translated version with the caption timing preserved.
Do I have to commit to a monthly plan?
No. Pepys is pay-as-you-go – buy a block of hours, use them across however many videos you caption, and the credits never expire. You can start free with 60 minutes, no card.
More industries
Turn your next video into accurate captions and a transcript – and pay only for that video.
Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.