Speaker diarization explained: why 'who said what' is the most underrated AI feature
A transcript without speaker attribution is a wall of accurate text. Speaker diarization assigns each sentence to the voice that spoke it — turning a recording into a conversation you can analyse, reference, and act on. Here's how it works and where it matters most.
- Speaker diarization partitions a recording by voice — producing 'Speaker 1: ...' / 'Speaker 2: ...' attribution rather than an undifferentiated text block.
- Clinical documentation, qualitative research, group project meetings, and legal compliance all depend on knowing who said what — not just what was said.
- Cloud diarization has historically had more compute; on-device diarization on Apple Silicon Neural Engine now handles 2–4 speakers accurately in real time without internet.
- On-device diarization sends nothing about your speakers' voices to a server — essential for clinical, therapeutic, research, and legal contexts.
You receive a transcript from last week's team meeting. It's 4,000 words. Accurate words — every sentence is there. But it reads like a monologue: no indication of who asked the question that changed the direction of the discussion, no indication of who committed to the deadline that now matters, no indication of whether the objection that got overruled came from the lead developer or the project manager. To find out, you'll need to re-listen to the recording.
This is the problem that speaker diarization solves. And it is why speaker diarization — "who said what" attribution in a transcript — is the feature that most separates a useful note from a wall of text.
What speaker diarization is
Diarization (from the Latin diarium, a daily record) refers to the process of partitioning an audio stream by speaker identity. A diarization system takes a recording and produces output that identifies distinct voices and assigns each segment of speech to the voice that produced it.
The result: instead of "The team discussed the deadline," you get "Speaker 1 (confirmed as the project lead): We need to hit the 15th. / Speaker 2 (confirmed as engineering): That's not realistic given the current scope. / Speaker 1: Then we cut scope. What's removable?"
The conversation is the same. The attribution changes what you can do with it.
A diarized transcript is queryable: what did the consultant say about this patient's management? What did the registrar add? What did the patient report about their symptoms? With attribution, you can isolate and review any participant's contribution independently.
How it works
Speaker diarization uses a combination of voice activity detection and voice embedding comparison. The system first identifies when speech is occurring versus background noise (voice activity detection). It then generates a mathematical representation — an embedding — of the acoustic characteristics of each detected voice: fundamental frequency, formant patterns, spectral envelope, speaking cadence. These embeddings are compared across the recording to cluster speech segments that came from the same voice.
The result is speaker-labelled segments: "Speaker 1 from 00:00:12 to 00:00:45, Speaker 2 from 00:00:45 to 00:01:20, Speaker 1 from 00:01:20 to 00:01:52." Combined with the transcript, this produces the attributed output that makes a multi-speaker recording analytically useful.
Cloud diarization systems have historically had more compute available for this process, running larger neural models on powerful servers. On-device diarization was computationally limited on older iPhone hardware. The Apple Silicon Neural Engine — introduced in the A14 Bionic (iPhone 12 onwards) and expanded significantly in the A15, A16, A17, and M-series chips — changed this. The Neural Engine's matrix multiplication throughput is sufficient to run voice embedding and clustering models in real time on the device.
Kuulo's diarization runs entirely on-device, in real time, with no internet connection. The voice profiles are computed locally. The attributed transcript is produced locally. Nothing about your speakers' voices is transmitted anywhere.
Where diarization matters most
Clinical environments
A ward round typically involves a consultant, one or more registrars, a junior doctor, a nurse, and the patient. The consultant's clinical decisions, the registrar's assessment, the nurse's observations, and the patient's reported history are four distinct data streams that a clinical note must capture — and ideally attribute.
Without diarization, a ward round transcript is unusable for clinical documentation: you have a record of what was said, but not who was responsible for each clinical statement. The consultant's management decision and the patient's symptom report are equally unattributed.
With diarization, you can isolate the consultant's teaching, extract the patient's history in their own words, and attribute the management plan to the clinical decision-maker who made it. The resulting SOAP note is more clinically accurate because it distinguishes subjective from objective from assessment from plan — and each section is populated from the speech of the appropriate participant.
Research interviews
Qualitative research transcripts require speaker attribution for analysis. The interviewer's questions structure the data collection; the participant's responses are the data. A transcript that conflates the two requires significant preparation work before analysis can begin — every utterance must be manually labelled with the speaker who produced it.
For a PhD researcher conducting 30 interviews of 90 minutes each, manual speaker attribution is a significant additional time cost on top of the transcription work. Diarization automates this: the interviewer and participant voices are separated by voice profile, and the transcript arrives pre-labelled.
This is particularly valuable in semi-structured interviews where probing follow-up questions are extensive. The distinction between "how does that make you feel?" (interviewer, not data) and "it makes me feel completely powerless" (participant, primary data) matters for qualitative analysis. Diarization maintains this distinction automatically.
Group work and team meetings
When a group project decision is attributed to "someone," there is a question later about who committed to what. When a meeting's action items are unattributed in the notes, nobody is accountable.
Diarization solves the accountability problem at the point of capture rather than requiring a follow-up "who said they'd do X?" conversation. The transcript shows: "Speaker 2 (identified as the marketing lead): I'll have the campaign draft to you by Thursday." That commitment is documented, attributed, and searchable.
For student group projects, committee meetings, and team retrospectives — all contexts where attribution affects outcomes — diarized notes produce a significantly different record from unattributed transcripts.
Legal and compliance contexts
"Who authorized the transaction?" and "Who gave the instruction to proceed?" are questions that compliance functions ask after things go wrong. A recorded meeting that produces an attributed transcript answers these questions. An unattributed transcript requires a recording re-listen to establish who said what.
For teams operating in regulated contexts — financial services, healthcare, legal practice — diarized records create an attribution layer that is valuable for both compliance documentation and dispute resolution.
On-device diarization vs. cloud diarization
Most cloud meeting tools include diarization — Otter, Fireflies, Granola, tl;dv all attribute speakers in their transcripts. The standard against which on-device systems are compared is cloud diarization, which has historically had access to more compute and larger models.
The current picture on iPhone:
Speaker count. On-device diarization handles 2–4 simultaneous speakers reliably. Cloud systems generally handle larger speaker counts more accurately. For a one-on-one interview, a two-person consultation, or a small meeting, on-device diarization is accurate and reliable. For a large conference call with 12 participants speaking simultaneously, cloud systems have an advantage.
Voice similarity. Distinguishing speakers with similar voices — same sex, similar age, similar accent — is harder than distinguishing clearly different voices. Both cloud and on-device systems perform better when voice profiles are more distinct.
Background noise. Noisy environments (busy wards, open-plan offices, outdoor locations) affect diarization accuracy for both cloud and on-device systems. On-device systems, running at the limits of available compute, may be more affected by challenging acoustic conditions than cloud systems running large server-side models.
Privacy. On-device diarization sends nothing about your speakers' voices to a server. Cloud diarization requires voice audio to be transmitted for processing. For clinical, legal, therapeutic, and research contexts, this distinction is significant.
Offline operation. On-device works without internet. Cloud requires connectivity at transcription time.
For the majority of use cases — two-person interviews, small clinical encounters, team meetings of 4–6 people — on-device diarization on current iPhone hardware is accurate and reliable. For large multi-party calls with ambiguous voice profiles and a cloud-compliant content requirement, cloud diarization has practical advantages.
How to get the most from on-device diarization
Consistent microphone position. The diarization model builds voice profiles from the audio it receives. Placing the phone in a consistent location relative to all speakers — face-up on a table in the centre of the room — gives each voice similar acoustic representation and improves attribution accuracy.
A brief voice enrollment. In a two-person recording, a sentence of context before the main conversation begins ("This is Dr. Smith speaking with the patient, Mr. Jones, about his presenting complaint") gives the diarization model clear baseline audio for each voice profile. The system then applies these profiles across the full recording.
Speaker identification at the start. When you review the attributed transcript, you'll see "Speaker 1" and "Speaker 2" rather than names. A brief note at the start of the recording — or a label applied during transcript review — maps the speaker numbers to the participants. "Speaker 1 = Consultant, Speaker 2 = Registrar, Speaker 3 = Patient" takes 10 seconds to add and transforms the review experience.
Limit simultaneous speakers. If multiple people speak at the same time — as happens in lively group discussions — diarization accuracy decreases. Facilitating clear turn-taking in a meeting produces better attributed transcripts from both on-device and cloud systems.
Speaker attribution is the feature that turns a transcript into a record of who did and said what. For clinical documentation, research analysis, legal compliance, and group accountability, it is not optional — it is the functional core of what makes a multi-speaker transcript worth having. The fact that it now runs on-device, without any voice data leaving the device, is what makes it appropriate for the contexts where it matters most.
Frequently asked questions
What is speaker diarization?
Speaker diarization is the process of partitioning an audio recording by speaker identity — determining which person said each segment of speech and labelling the transcript accordingly. The result is an attributed transcript showing 'Speaker 1: [text]' / 'Speaker 2: [text]' rather than a continuous block of unattributed text.
Which apps can identify different speakers in a recording?
Cloud tools including Otter, Fireflies, and tl;dv include speaker diarization in their transcripts. Kuulo is the only app that performs speaker diarization entirely on-device, without internet — relevant for clinical, legal, and research content that cannot be sent to cloud servers.
Does speaker diarization work offline?
Most transcription apps require cloud processing for diarization. Kuulo runs speaker diarization entirely on the Neural Engine of an iPhone — it works offline, with no audio transmitted for processing.
How accurate is on-device speaker diarization?
On-device diarization on current Apple Silicon iPhone handles 2–4 distinct speakers accurately in normal acoustic conditions. It performs best when speakers have clearly distinct voices and take clear turns. Large groups with overlapping speech and similar voices are more challenging — cloud systems have an accuracy advantage for those scenarios.