Guide

What is speaker diarization?

A plain explanation for researchers and journalists – what 'who spoke when' means, how it works, how accurate it is, and where the privacy lines fall.

The short answer

Speaker diarization is the task of labeling a recording by speaker – 'who spoke when' – without knowing anyone's real identity or even how many people are talking. It splits audio into speech segments, groups them by voice, and tags each with a generic label like Speaker 1. It runs separately from speech recognition, which turns speech into words.

What is speaker diarization?

Speaker diarization labels an audio or video recording by speaker identity – in the field's own shorthand, 'who spoke when' (Park et al., 2022). It doesn't give you the words. It gives you the turns: this stretch is Speaker 1, that stretch is Speaker 2. The task is settled enough to carry a standards-body name – NIST's Rich Transcription evaluations call it 'Who Spoke When'.

Diarization began as a front end for automatic speech recognition (ASR), then gained its own value as a standalone task (Park et al., 2022). In a finished transcript you read the two together: ASR supplies the words, diarization supplies the speaker turns. For an interview or a panel, the diarization is what makes a line attributable to a source, which is why it sits at the center of the two-speaker attributable-quote workflow.

How does speaker diarization work?

Most systems run a modular pipeline (Park et al., 2022). First, voice activity detection separates speech from non-speech. The speech is cut into short segments, and each segment is turned into an embedding vector – a numeric fingerprint of the voice in that clip. A clustering stage then groups those vectors and labels each group as a speaker.

A newer approach collapses that whole chain into one model. End-to-end neural diarization (EEND) performs every step inside a single neural network instead of stitching separate stages together (Park et al., 2022). Both routes target the same output: time-stamped speaker turns you can lay over the transcript.

You don't build any of this yourself. If you just want the labeled turns on your own file, run diarization on the recording and read the result against the audio.

How accurate is speaker diarization?

Accuracy is scored as Diarization Error Rate (DER) – the sum of missed speech, false-alarm speech, and speaker-confusion time (Ryant et al., DIHARD III). On clean, one- or two-party audio, DIHARD III reported median DER below 10%. On the hardest domains – meeting speech, web videos, and restaurant audio – median DER ran from about 35% to 45%.

The thing that breaks diarization is people talking over each other. Even with a forgiveness collar, missed detection and false alarms from overlapping speech are the main error source, 'twice as high as speaker confusion' (Bredin & Laurent, 2021). That single fact explains most bad transcripts.

So the practical lever is the audio, not the software. Record speakers apart where you can, then correct the overlaps by hand – the same discipline behind transcribing multiple speakers well.

Is speaker diarization the same as speaker recognition?

No. Diarization requires no prior knowledge of the speakers – not their real identity, and not even how many are in the room (Park et al., 2022). It answers 'who spoke when' with anonymous labels. Speaker recognition, verification, and identification do the opposite: they match a voice against a known, enrolled identity.

That distinction matters in practice. Diarization hands you Speaker 1 and Speaker 2; you map those to real names yourself, from context or from the intros you recorded. Nothing in the diarization step knows who the people are, and nothing enrolls their voices for later matching.

Does speaker diarization raise privacy concerns?

Diarization itself identifies no one, but the voice data around it can. Under GDPR, biometric data means data from technical processing that can uniquely identify a person (Art. 4(14)). Article 9 then makes biometric data used to uniquely identify someone a special category whose processing is prohibited absent an exception. Illinois' BIPA names a voiceprint a biometric identifier outright.

The legal line tracks the technical one. Diarization's anonymous 'Speaker 1' is not a voiceprint matched to a named person. The regulated territory starts when a workflow enrolls voices to recognize identities – speaker recognition, not diarization.

How you represent those turns still isn't neutral. Transcription is 'a powerful act of representation' whose choices can shape research findings (Oliver, Serovich & Mason, 2005). Keep the turns tied to timestamped output so every attribution stays auditable against the audio.

Tips from people who do this a lot

A separated recording beats every software setting – overlapping speech drives most of the error, so isolate voices at capture.
Don't trust the speaker count blindly. Diarization infers how many people are talking; verify it and fix over- or under-splitting.
Anonymous labels are not voiceprints. Diarization matches no identities, so it isn't speaker recognition and doesn't enroll anyone's voice.
Expect the worst accuracy on meetings, web video, and noisy rooms – those are the domains where measured error rates climb to 35–45%.
Spot-check the overlaps first. Crosstalk is where missed detection and false alarms cluster, so that's where hand-correction pays off most.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

What is speaker diarization – questions, answered

What does speaker diarization mean?

It means labeling a recording by speaker – 'who spoke when' – so you know which voice owns each stretch of audio. It assigns anonymous labels like Speaker 1 and Speaker 2, and it runs separately from speech recognition, which is the step that turns speech into written words.

What is Diarization Error Rate?

DER is the standard accuracy metric: the sum of missed speech, false-alarm speech, and speaker-confusion time. In the DIHARD III challenge, median DER fell below 10% on clean one- or two-party audio but rose to roughly 35–45% on meetings, web video, and restaurant recordings.

Is speaker diarization the same as speaker recognition?

No. Diarization needs no prior knowledge of the speakers – not their identity, not even how many are talking – and outputs anonymous labels. Speaker recognition and verification instead match a voice to a known, enrolled identity. Diarization tells you who spoke when, never who the person is.

Why does diarization get speakers wrong?

Overlapping speech is the main reason. When two people talk at once, systems miss speech or raise false alarms, and even with a forgiveness collar those errors run about twice as high as speaker confusion. Recording speakers on separate channels is the most effective fix.

Is a voiceprint the same as diarization?

No. A voiceprint is a biometric identifier – named as such under Illinois BIPA and treated as special-category data under GDPR when used to uniquely identify someone. Diarization uses anonymous 'Speaker 1' labels and matches no identities, so it isn't voice biometrics on its own.

References

1.Park et al. (2022), A Review of Speaker Diarization: Recent Advances with Deep Learning – Computer Speech & Language (Elsevier)
2.Rich Transcription Evaluation – 'Who Spoke When' speaker diarization – National Institute of Standards and Technology (NIST)
3.Ryant et al., The Third DIHARD Diarization Challenge (DER by domain) – DIHARD III organizers / arXiv
4.Bredin & Laurent (2021), End-to-end speaker segmentation for overlap-aware resegmentation – ISCA Interspeech / arXiv
5.GDPR Article 4(14) – definition of biometric data – Regulation (EU) 2016/679 (European Parliament & Council)
6.GDPR Article 9(1) – special categories of personal data – Regulation (EU) 2016/679 (European Parliament & Council)
7.Biometric Information Privacy Act, 740 ILCS 14/10 (voiceprint definition) – Illinois General Assembly
8.Oliver, Serovich & Mason (2005), Constraints and Opportunities with Interview Transcription – Social Forces (Oxford University Press)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing