Video transcription, built for the edit
Drop in the footage or paste a link – get a speaker-labeled, click-to-seek transcript plus ready-to-burn captions, so you can cut to the words instead of scrubbing the timeline.
60 min free · no card required · we never train on your audio
How do you transcribe videos?
To transcribe a video, upload the file or paste its link and Pepys returns a speaker-labeled, time-coded transcript in minutes – plus exportable SRT and VTT captions and a quick AI summary. It's pay-as-you-go with no subscription, and credits never expire.
Made for videographers
Every project lives twice: once as footage on a card and once as a timeline you have to assemble. The hardest part of the assembly is finding the right words buried across hours of interviews, b-roll banter, and run-and-gun audio. A transcript turns that haystack into a document you can read, search, and cut from – so you spend your hours coloring and shaping, not scrubbing back and forth hunting for the one line that makes the cut.
The reality of a paper edit is matching what was said to where it lives, and that means word-level timestamps you can click to seek and speaker labels that keep your subjects from collapsing into one block. Pick your selects on the page, mark your circle-takes, then export frame-accurate SRT and VTT straight into Premiere or DaVinci Resolve. Voice-driven video transcription means you build the cut around the line that lands and let the footage follow it, instead of the other way around.
Clean paragraphs. No more um's and ah's.
The left is what Pepys hands back – logical paragraphs with the filler stripped out, punctuated and readable. The right is the raw, one-line-per-segment dump most transcribers leave you with.
um so yeah everyone keeps telling you to like lead with your best line right but uh honestly if you give away the whole answer in the first second you know there's basically no reason for anyone to keep watching so the hook isn't kind of the smartest thing you say it's like a loop you open that they need to close and um that's the part that actually keeps people around
RawCaptions for every cut
Frame-accurate SRT and VTT files that drop straight into your NLE or your social uploads, no retyping.
A paper edit you can read
A clean, time-coded transcript so you can mark your selects on the page before you ever touch the timeline.
Find any line in seconds
A searchable transcript that jumps you to the exact frame a phrase was spoken, instead of scrubbing for it.
Pull the soundbites
A quick summary surfaces the strongest lines, so the clips you cut for the highlight reel write themselves.
Built in, not bolted on
A searchable transcript, summary, and captions – the moment it uploads
Every videois analyzed automatically the moment it’s transcribed. Here’s a real sample, run through it.
On-Set Memo: Shooting the Hartley Wedding So the Edit Cuts Itself
A two-camera wedding shoot planned out loud before call time. The locked wide is the safety net while the long lens hunts reactions, and the whole approach is built around the vows audio, because the edit is cut to the voice first and the picture is built around it. Lav-and-backup-recorder audio, exposing for faces against a blown-out window, grabbing b-roll and safe portrait frames early, mirrored cards and fresh batteries, and a two-drive backup before leaving all serve one goal: never lose the moment the couple actually paid for.
Key points
- Two-camera plan: the A-cam wide of the altar is the locked safety shot, while the B-cam long lens lives on faces – "The story is in the reactions, not the wide."
- Audio is treated as the make-or-break: a lav on the officiant plus a backup recorder on the lectern, because "If the lav fails, the whole ceremony is unusable".
- Expose for the couple against the harsh west window: "A blown window looks intentional. A muddy gray face looks like a mistake."
- The edit is voice-first: "Find the line, then find the frame. The voice drives the cut, never the other way around."
- Capture b-roll and the five known-good portrait frames early: "Get the safe shots before you get the pretty shots", since golden hour is roughly twenty minutes of light.
- Protect the footage: mirror to two cards per body, fresh batteries at three forty-five, and back up to two drives before leaving – "The footage doesn't exist until it's in two places."
Clean, speaker-labeled, click-to-seek
Ask, don’t scrub
Ask the transcript anything.
An hour-long recording? Don’t skim it – ask. Every answer stays grounded in your transcript and cites the exact timestamp, so you can jump to the moment and check it yourself.
What's the audio plan for the ceremony, and what's the backup if it fails?
She's putting a lav on the officiant plus a recorder in his jacket pocket, because the on-camera mic is garbage at thirty feet. If the lav fails the whole ceremony is unusable, so she's also running a backup recorder on the lectern.
Why does she expose for the couple's faces and let the window blow out?
The four o'clock sun comes straight through the big west window behind the altar, so the couple will be backlit. She exposes for their faces and lets the window go, on the logic that a blown window looks intentional while a muddy gray face looks like a mistake.
Grounded in your transcript – if the answer isn’t in the audio, it says so instead of guessing.
Who said what
Speaker labels that survive cross-talk
Automatic speaker diarization. Two people, four people, cross-talk and interruptions – interviews, panels, messy meetings. Pepys keeps each voice on its own line instead of blurring them into one, so you never rewind to figure out who was talking.
So the festival nearly didn't happen this year–
–it almost didn't. We lost the venue three weeks out.
Three weeks? How do you even start to–
You call everyone you know. The whole town pitched in.
And that's how it ended up in the park.
Works with the platforms you live in.
Paste a link from YouTube, TikTok, Instagram, Facebook, Spotify, or Apple Podcasts – or drop in any audio or video file. We transcribe it once, then you export it however your workflow needs.
- YouTube
- TikTok
- Spotify
- Apple Podcasts
- or any file
Export to any format
- TXT
- Markdown
- DOCX
- SRT
- VTT
- JSON
Most useful for videographers: SRT · VTT · TXT · DOCX · PDF
Timestamps, speaker labels, and subtitle timing carry through to every export.
How video transcription works
Upload or paste a link
Drop your video or paste its link – any audio or video, in any language.
Get your transcript
A clean, speaker-labeled transcript with AI notes tuned to your format, ready in minutes.
Edit and export
Fix anything inline, then export to SRT, VTT, TXT, DOCX, PDF, or JSON.
Why videographers pick Pepys
No subscription – pay per video, and credits never expire between shoots.
Captions are built in, not a separate caption tool to round-trip through.
Paste a YouTube, Vimeo, or direct video link – no exporting the file first.
Speaker labels keep your interview subjects from blurring into one block of text.
What videographers say
captions, chapters AND a hook breakdown straight off the upload. i pull 3 shorts out of every long video now. huge.
Daniel K.YouTube creator · Product HuntI transcribe in the original language and receive a translated version with the subtitles still intact. It saved an entire round of contractor work on my last film. Thank you for building this.
Giulia F.Documentary filmmaker · email every module comes back captioned with a handout written from the transcript. launch prep went from a week to an afternoon, wish id found this sooner honestly.
Alina M.Course creator · Reddit
Video transcription – questions, answered
How do I transcribe a video?
Upload the video file or paste its link (YouTube, Vimeo, or a direct URL) and Pepys returns a speaker-labeled, time-coded transcript in minutes, along with a short AI summary and exportable captions. You don't need to strip the audio out first.
Can I get burn-in or sidecar captions for my edit?
Yes. Every video exports to SRT and VTT, both frame-accurate and ready to import into Premiere, DaVinci Resolve, Final Cut, or a social uploader. Edit any wording inline before you export.
Does it separate the people speaking in an interview?
Yes. Speaker diarization splits each voice, so a multi-person interview or a two-subject piece comes back labeled rather than as one wall of text. Rename "Speaker 1" to your subject's name and it updates everywhere.
Can I do a paper edit from the transcript?
That's the point. The transcript is time-coded and click-to-seek, so you can read the whole shoot, mark your selects on the page, and jump straight to the frame each line was spoken before you build the timeline.
What can I export for a project?
SRT and VTT captions, plain text, a DOCX, and a PDF of the transcript. One click each, and the timecodes stay intact so everything lines up back in your NLE.
How does it handle on-location audio and accents?
It auto-detects the spoken language across 99+ languages and handles a range of accents and noisier run-and-gun audio. Anything it mishears you can fix inline in the editor before exporting.
Do I have to subscribe?
No. Pepys is pay-as-you-go – buy a block of hours, use them across as many shoots as you like, and the credits never expire. You can start free with 60 minutes, no card.
More industries
Turn your next shoot into a searchable transcript and ready-to-burn captions – and pay only for that video.
Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.