Why is transcribing multiple speakers so much harder?
More voices means more error, and it climbs fast. On the DIHARD III benchmark, median diarization error rate (DER) stays below 10% on the easier, cleaner domains but rises to roughly 35–45% on the hardest multi-party, overlap-heavy audio. That hardest tier is meeting speech, web video, and restaurant recordings. The organizers are explicit that the difficulty is driven by the number of speakers and how much they overlap, not by the topic being discussed.
The specific thing that breaks is overlap. When two people talk at once, the tool has to detect that both are speaking and split the words between them – and that's the weak point. In one study, missed detections and false alarms from overlapped speech were the main error source, about twice as high as speaker confusion. So the transcript isn't usually wrong about who said a clean sentence; it's wrong at the exact moments people interrupt each other.
That reframes the job. A two-person interview is close to a solved problem for modern tools. A four-person panel with crosstalk is not. The fix is better-separated audio plus a targeted human pass over the overlaps.
How do you separate voices to transcribe multiple speakers?
The single biggest upgrade to speaker labeling happens before you press record: give each voice its own channel. When speakers are already separated in the audio, the tool transcribes one clean stream per person and stitches the turns together.
For remote calls, record per participant. Zoom's local recording can save a separate audio file for each participant, and several other recorders do the same – isolated tracks that make diarization (the who-said-what labeling) far more reliable. In a physical room, a lav mic clipped to each speaker beats one recorder in the middle of the table. Seat people so they're not talking over one another, and go around the table for names at the top.
If all you have is a single mixed file, that's still workable – you'll just correct more speaker turns by hand, concentrated around the overlaps. Know that going in and budget the cleanup time for a busy multi-party recording.
Should you let AI label first, then fix by hand?
Yes – for many voices it's the only sane order of operations. Transcribing by hand runs up to about six hours of work for a single hour of audio, and that cost balloons when you're also deciding who said each line. An AI first pass turns that hour into minutes of processing plus a focused cleanup, so you spend your attention where the machine is weak instead of retyping the parts it gets right.
Upload the file to a multi-speaker transcription tool and you'll get a speaker-labeled, timestamped draft to work from. Then read it against the audio and target the known failure points: the overlaps and crosstalk, plus proper nouns, jargon, and fast numbers. These are exactly the load-bearing spots for an attributable quote.
Don't clean the whole thing to publication quality. Fix the speaker turns and the lines you'll actually quote or code; leave the rest searchable. Mark anything genuinely unclear as [inaudible] with its timestamp rather than guessing at a name or a word.
How do you keep who-said-what through editing and export?
Lock down two things before you edit: the labels and the verbatim style. Rename "Speaker 1/2/3" to real names or role labels once, near the top, and the attribution carries down the whole transcript. Then pick a verbatim style and apply it consistently, because in research that choice is an act of representation that shapes the analysis, not just how a line reads. Naturalized (strict verbatim) keeps every stammer and false start; denaturalized (clean) tidies grammar and drops filler.
Preserve the labels when you leave the tool. Exporting to DOCX with the speaker names intact means the who-said-what survives into your coding software or your draft, instead of collapsing into an anonymous block of text. Keep timestamps too, so any quote is one click from the audio for a fact-check.
For a larger group where you're tracking themes across many voices – a focus group, for instance – consistent labels are what let you attribute a pattern to specific participants later, rather than a vague "someone said."
What are the consent rules for recording several people?
Get everyone's consent, and capture it in the recording. US law splits on this: federal law and most states allow one-party consent, but roughly a dozen states require all-party consent, and the rule can differ for in-person versus phone conversations. With multiple people on the recording, all-party consent is the safe default – assume the stricter rule applies.
Practically, that means a clear yes from each person before the substance starts, ideally on the record so it's timestamped in the audio itself. For a group call, ask at the top and let everyone answer. We can't give legal advice, and rules vary by country, so when in doubt, get explicit agreement before you record.
Then mind where the audio and transcript live, especially for sensitive or off-the-record material. Use a tool that doesn't train on your files and lets you delete them after processing. Pepys never trains on your audio or transcripts, and you can auto-delete files once they're transcribed.