What is speaker diarization?
Speaker diarization labels an audio or video recording by speaker identity – in the field's own shorthand, 'who spoke when' (Park et al., 2022). It doesn't give you the words. It gives you the turns: this stretch is Speaker 1, that stretch is Speaker 2. The task is settled enough to carry a standards-body name – NIST's Rich Transcription evaluations call it 'Who Spoke When'.
Diarization began as a front end for automatic speech recognition (ASR), then gained its own value as a standalone task (Park et al., 2022). In a finished transcript you read the two together: ASR supplies the words, diarization supplies the speaker turns. For an interview or a panel, the diarization is what makes a line attributable to a source, which is why it sits at the center of the two-speaker attributable-quote workflow.
How does speaker diarization work?
Most systems run a modular pipeline (Park et al., 2022). First, voice activity detection separates speech from non-speech. The speech is cut into short segments, and each segment is turned into an embedding vector – a numeric fingerprint of the voice in that clip. A clustering stage then groups those vectors and labels each group as a speaker.
A newer approach collapses that whole chain into one model. End-to-end neural diarization (EEND) performs every step inside a single neural network instead of stitching separate stages together (Park et al., 2022). Both routes target the same output: time-stamped speaker turns you can lay over the transcript.
You don't build any of this yourself. If you just want the labeled turns on your own file, run diarization on the recording and read the result against the audio.
How accurate is speaker diarization?
Accuracy is scored as Diarization Error Rate (DER) – the sum of missed speech, false-alarm speech, and speaker-confusion time (Ryant et al., DIHARD III). On clean, one- or two-party audio, DIHARD III reported median DER below 10%. On the hardest domains – meeting speech, web videos, and restaurant audio – median DER ran from about 35% to 45%.
The thing that breaks diarization is people talking over each other. Even with a forgiveness collar, missed detection and false alarms from overlapping speech are the main error source, 'twice as high as speaker confusion' (Bredin & Laurent, 2021). That single fact explains most bad transcripts.
So the practical lever is the audio, not the software. Record speakers apart where you can, then correct the overlaps by hand – the same discipline behind transcribing multiple speakers well.
Is speaker diarization the same as speaker recognition?
No. Diarization requires no prior knowledge of the speakers – not their real identity, and not even how many are in the room (Park et al., 2022). It answers 'who spoke when' with anonymous labels. Speaker recognition, verification, and identification do the opposite: they match a voice against a known, enrolled identity.
That distinction matters in practice. Diarization hands you Speaker 1 and Speaker 2; you map those to real names yourself, from context or from the intros you recorded. Nothing in the diarization step knows who the people are, and nothing enrolls their voices for later matching.
Does speaker diarization raise privacy concerns?
Diarization itself identifies no one, but the voice data around it can. Under GDPR, biometric data means data from technical processing that can uniquely identify a person (Art. 4(14)). Article 9 then makes biometric data used to uniquely identify someone a special category whose processing is prohibited absent an exception. Illinois' BIPA names a voiceprint a biometric identifier outright.
The legal line tracks the technical one. Diarization's anonymous 'Speaker 1' is not a voiceprint matched to a named person. The regulated territory starts when a workflow enrolls voices to recognize identities – speaker recognition, not diarization.
How you represent those turns still isn't neutral. Transcription is 'a powerful act of representation' whose choices can shape research findings (Oliver, Serovich & Mason, 2005). Keep the turns tied to timestamped output so every attribution stays auditable against the audio.