Guide

How to transcribe multiple speakers

A working guide for researchers, journalists, and anyone turning a roomful of voices into an accurate, correctly-attributed transcript.

The short answer

To transcribe multiple speakers, record each person on a separate channel where you can, then upload the audio for an AI first pass that labels every speaker turn and adds timestamps. Read the draft against the audio and fix the overlaps by hand, because overlapping speech is where automatic speaker labeling fails most. Keep the labels through export so who-said-what survives into your notes.

Why is transcribing multiple speakers so much harder?

More voices means more error, and it climbs fast. On the DIHARD III benchmark, median diarization error rate (DER) stays below 10% on the easier, cleaner domains but rises to roughly 35–45% on the hardest multi-party, overlap-heavy audio. That hardest tier is meeting speech, web video, and restaurant recordings. The organizers are explicit that the difficulty is driven by the number of speakers and how much they overlap, not by the topic being discussed.

The specific thing that breaks is overlap. When two people talk at once, the tool has to detect that both are speaking and split the words between them – and that's the weak point. In one study, missed detections and false alarms from overlapped speech were the main error source, about twice as high as speaker confusion. So the transcript isn't usually wrong about who said a clean sentence; it's wrong at the exact moments people interrupt each other.

That reframes the job. A two-person interview is close to a solved problem for modern tools. A four-person panel with crosstalk is not. The fix is better-separated audio plus a targeted human pass over the overlaps.

How do you separate voices to transcribe multiple speakers?

The single biggest upgrade to speaker labeling happens before you press record: give each voice its own channel. When speakers are already separated in the audio, the tool transcribes one clean stream per person and stitches the turns together.

For remote calls, record per participant. Zoom's local recording can save a separate audio file for each participant, and several other recorders do the same – isolated tracks that make diarization (the who-said-what labeling) far more reliable. In a physical room, a lav mic clipped to each speaker beats one recorder in the middle of the table. Seat people so they're not talking over one another, and go around the table for names at the top.

If all you have is a single mixed file, that's still workable – you'll just correct more speaker turns by hand, concentrated around the overlaps. Know that going in and budget the cleanup time for a busy multi-party recording.

Should you let AI label first, then fix by hand?

Yes – for many voices it's the only sane order of operations. Transcribing by hand runs up to about six hours of work for a single hour of audio, and that cost balloons when you're also deciding who said each line. An AI first pass turns that hour into minutes of processing plus a focused cleanup, so you spend your attention where the machine is weak instead of retyping the parts it gets right.

Upload the file to a multi-speaker transcription tool and you'll get a speaker-labeled, timestamped draft to work from. Then read it against the audio and target the known failure points: the overlaps and crosstalk, plus proper nouns, jargon, and fast numbers. These are exactly the load-bearing spots for an attributable quote.

Don't clean the whole thing to publication quality. Fix the speaker turns and the lines you'll actually quote or code; leave the rest searchable. Mark anything genuinely unclear as [inaudible] with its timestamp rather than guessing at a name or a word.

How do you keep who-said-what through editing and export?

Lock down two things before you edit: the labels and the verbatim style. Rename "Speaker 1/2/3" to real names or role labels once, near the top, and the attribution carries down the whole transcript. Then pick a verbatim style and apply it consistently, because in research that choice is an act of representation that shapes the analysis, not just how a line reads. Naturalized (strict verbatim) keeps every stammer and false start; denaturalized (clean) tidies grammar and drops filler.

Preserve the labels when you leave the tool. Exporting to DOCX with the speaker names intact means the who-said-what survives into your coding software or your draft, instead of collapsing into an anonymous block of text. Keep timestamps too, so any quote is one click from the audio for a fact-check.

For a larger group where you're tracking themes across many voices – a focus group, for instance – consistent labels are what let you attribute a pattern to specific participants later, rather than a vague "someone said."

What are the consent rules for recording several people?

Get everyone's consent, and capture it in the recording. US law splits on this: federal law and most states allow one-party consent, but roughly a dozen states require all-party consent, and the rule can differ for in-person versus phone conversations. With multiple people on the recording, all-party consent is the safe default – assume the stricter rule applies.

Practically, that means a clear yes from each person before the substance starts, ideally on the record so it's timestamped in the audio itself. For a group call, ask at the top and let everyone answer. We can't give legal advice, and rules vary by country, so when in doubt, get explicit agreement before you record.

Then mind where the audio and transcript live, especially for sensitive or off-the-record material. Use a tool that doesn't train on your files and lets you delete them after processing. Pepys never trains on your audio or transcripts, and you can auto-delete files once they're transcribed.

The steps, in order

01
Separate each speaker at the source
Record per participant where you can – Zoom's per-participant files or a lav mic on each person – so voices stay separable. Go around the room for names and consent up top.
02
Upload for an AI first pass
Drop the file in for a speaker-labeled, timestamped draft in minutes, instead of hours spent typing and deciding who said each line by hand.
03
Fix the overlaps first
Read the draft against the audio and target the crosstalk, where automatic labeling fails most. Also check proper nouns, jargon, and numbers; mark unclear spots [inaudible] with timestamps.
04
Name the speakers and set a style
Rename Speaker 1/2/3 to real names or roles once, near the top, so attribution carries down. Pick one verbatim style, strict or clean, and apply it consistently.
05
Export with labels and timestamps intact
Export to DOCX or TXT with speaker names preserved so who-said-what survives into your coding tool or draft. Keep timestamps for fact-checking, then store or delete the audio.

Tips from people who do this a lot

Per-speaker recording is the biggest lever on multi-speaker accuracy – far more than any setting in the transcription tool. Separated tracks mean the machine never has to guess during crosstalk.
Seat people so they don't naturally talk over each other, and gently ask a group to speak one at a time. Every avoided overlap is a speaker turn you won't fix by hand.
Rename speakers once at the top, not line by line. Fixing the label near the first turn carries the correct attribution through the whole transcript.
Budget extra cleanup time for a single mixed file with three or more voices – the overlaps are where you'll spend it, so read those passages closely.
Keep an un-redacted master with real names in a secure place, and do any anonymization in a copy, so you never lose the original attribution if you need to verify a quote.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

How to transcribe multiple speakers – questions, answered

How many speakers can be transcribed at once?

There's no hard cap, but accuracy drops as voices and overlap increase. On research benchmarks, diarization error climbs from under 10% on clean audio to 35–45% on the hardest multi-party recordings. Separating each speaker onto their own channel is what keeps a larger group accurate.

Why does the transcript mix up who said what?

Almost always because of overlapping speech. When two people talk at once, the tool has to detect both and split the words, and that's its weakest point – overlapped speech is the single largest error source. Recording each speaker on a separate channel fixes most of it.

How do I get accurate speaker labels for a group?

Record each person on their own channel where you can, using per-participant recording or separate lav mics, so the tool isn't guessing during crosstalk. With a single mixed file you'll still get labels, but expect to correct more turns by hand around the overlaps.

Do I need consent from everyone I record?

Get consent from each person and capture it in the recording. Most US states allow one-party consent, but roughly a dozen require all-party consent, and rules differ for phone calls and by country. With several people recorded, treat all-party consent as the safe default.

Will my audio be kept or used to train AI?

Not with Pepys. We never train AI on your audio or transcripts, and you can auto-delete files after they're processed – which matters for group recordings with sensitive or off-the-record participants who need that assurance.

References

1.Ryant et al. (2021), The Third DIHARD Diarization Challenge – DER by domain – Interspeech 2021 / arXiv
2.Bredin & Laurent (2021), End-to-end speaker segmentation for overlap-aware resegmentation – Interspeech 2021 / arXiv
3.Haberl et al. (2023), Take the aTrain – transcription time cost, citing Bell et al. (2018) – arXiv / University of Graz
4.Starting a computer recording – separate audio file per participant – Zoom Support
5.Introduction to the Reporter's Recording Guide (state-by-state consent laws) – Reporters Committee for Freedom of the Press
6.Oliver, Serovich & Mason (2005), Constraints and Opportunities with Interview Transcription – Social Forces (Oxford University Press)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing