Pepys
12,438,517minutes transcribed

Speaker Diarization

Find out who said what – upload a file or paste a link and get a transcript split by speaker, with turn boundaries and talk-time per voice.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

Accepts MP3, M4A, WAV, MP4 and other audio or video files – or a link · returns a who-said-what transcript with speaker labels, turn timestamps, and talk-time.

Speaker labels come from voice separation, not identity – Pepys tags distinct voices as Speaker 1, Speaker 2, and so on. It doesn't recognize anyone by name or voiceprint; you rename the labels to the real names yourself.

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100k+ usersRated 4.9 out of 5 by 100k+ users

What is speaker diarization?

Pepys runs speaker diarization on your recording: it segments the audio by voice, labels each turn (Speaker 1, Speaker 2…), and reports talk-time per speaker. Upload a file or paste a link and get a who-said-what transcript with timestamps in minutes, in 99+ languages. Your first 60 minutes are free, no card.

How speaker diarization works

01

Upload audio or paste a link

Drop in a multi-speaker recording or paste a link – any format, any language.

02

Get diarized output

Pepys segments the audio by voice and labels each turn, with timestamps marking where each speaker starts and stops.

03

Rename, verify, and export

Swap generic labels for real names, check turns against the audio, then export to TXT, Markdown, DOCX, PDF, SRT, VTT, or structured JSON.

Speaker diarization answers one question a flat transcript can't: who is talking, and when? Pepys partitions a recording into speaker turns – Speaker 1, Speaker 2, and so on – so an interview, a panel, a focus group, or a two-host podcast reads as a clean back-and-forth instead of an undifferentiated wall of text. Each turn carries a start and end timestamp, and you get talk-time totals per voice for the questions that follow: who dominated, who barely spoke, where the handoffs happened.

It's built for anyone who needs to know who said what – researchers coding qualitative interviews, journalists attributing quotes, devs piping speaker turns into a meeting-notes or analytics tool. The labels are yours to rename inline (Speaker 2 becomes "Dr. Okafor"), and every turn exports as structured JSON – each segment with its speaker, start and end timestamps, and per-speaker talk-time – or as a clean speaker-labeled transcript. We never train on your audio, and credits never expire.

Clean paragraphs. No more um's and ah's.

The left is what Pepys hands back – logical paragraphs with the filler stripped out, punctuated and readable. The right is the raw, one-line-per-segment dump most transcribers leave you with.

reel-voiceover.mp4

um so yeah everyone keeps telling you to like lead with your best line right but uh honestly if you give away the whole answer in the first second you know there's basically no reason for anyone to keep watching so the hook isn't kind of the smartest thing you say it's like a loop you open that they need to close and um that's the part that actually keeps people around

Raw
BeforeAfter
  • Who-said-what turns with start/end timestamps and talk-time per speaker

  • Rename generic labels to real names inline – no re-running anything

  • Structured JSON export – segments, speaker labels, timestamps, and talk-time for your pipeline

  • 99+ languages, auto-detected · we never train on your audio · credits never expire

Works with the platforms you live in.

Paste a link from YouTube, TikTok, Instagram, Facebook, Spotify, or Apple Podcasts – or drop in any audio or video file. We transcribe it once, then you export it however your workflow needs.

  • YouTubeYouTube
  • TikTokTikTok
  • InstagramInstagram
  • FacebookFacebook
  • SpotifySpotify
  • Apple PodcastsApple Podcasts
  • or any file

Export to any format

  • TXT
  • Markdown
  • DOCX
  • PDF
  • SRT
  • VTT
  • JSON

Timestamps, speaker labels, and subtitle timing carry through to every export.

Speaker diarization – questions, answered

What is speaker diarization?

Speaker diarization is the process of partitioning a recording by who is speaking – segmenting the audio into turns and labeling each one (Speaker 1, Speaker 2…). It answers "who spoke when," separate from "what was said." Pepys does both: it diarizes and transcribes in one pass.

How is this different from plain transcription?

A plain transcript gives you the words. Diarization adds the speaker structure on top: turn boundaries, a label per voice, and talk-time totals. So a multi-speaker recording reads as an attributed back-and-forth instead of one continuous block.

How accurate is the speaker labeling?

Turn boundaries and labels are strong on clean, distinct voices. Heavy crosstalk, near-identical voices, or noisy audio can blur a turn or two – so labels start generic (Speaker 1, 2…) and you rename and correct any turn inline before exporting.

Does it tell me each speaker's name?

Diarization separates voices, it doesn't recognize identities – so speakers come out as Speaker 1, Speaker 2, and so on. You assign real names inline once, and the rename applies across every turn for that voice.

Can I export the speaker data as structured JSON?

Yes. The JSON export gives you every segment with its speaker label, start and end timestamps, and per-speaker talk-time in a Whisper-compatible shape you can feed straight into a script, analytics tool, or research workflow – or export TXT, Markdown, DOCX, PDF, SRT, or VTT for human-readable and subtitle use.

More free tools

Keep reading

Speaker diarization – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.