Why removing names isn't anonymization
Deleting names is the easy 20%. The hard part is that a person can still be singled out by the details around the name. Just a 5-digit ZIP code, gender, and date of birth uniquely identify roughly 87% of the US population (Sweeney, 2000), a figure later recomputed at about 63% on 2000 census data (Golle, 2006). Either way, three ordinary facts, no name required, can be enough.
In qualitative research this leak has a name: deductive disclosure. It happens when the traits of an individual or group make them identifiable in a research report (Kaiser, 2009), even after the obvious identifiers are gone. A quote about 'the only woman on the board of a 12-person Boise startup' names no one and still points at exactly one person.
This guide assumes you already have a clean, corrected transcript. If you're not there yet, start with how to transcribe an interview and finish the AI first pass and manual cleanup first. Anonymization is a separate, later pass, and you run it on a copy, never the original.
Is your transcript anonymized or only pseudonymized?
The distinction is legal, not cosmetic. Under GDPR, pseudonymization means data that can no longer be attributed to a person without separately-kept additional information (GDPR Article 4(5)). Swapping names for 'P07' while keeping a code key is pseudonymization, and pseudonymized data is still personal data (GDPR Recital 26).
True anonymization is a higher bar. Recital 26 says you judge it by the means 'reasonably likely to be used' to re-identify someone, weighing cost, time, and available technology. So a code key locked in a drawer doesn't make a transcript anonymous. It makes it pseudonymized, and still regulated as personal data.
For most researchers and journalists, pseudonymization plus careful generalization is the realistic target. Full, irreversible anonymization is hard to guarantee once you keep rich quotes. Know which one you've actually achieved before you promise a source confidentiality, because the two carry different obligations.
Which identifiers do you remove to anonymize an interview transcript?
Start from an established checklist. HIPAA's Safe Harbor method lists 18 specific identifiers to remove (45 CFR 164.514(b)(2)), from names and geographic units smaller than a state through to any other unique identifying code. It's written for health data, but the list is a solid baseline for any transcript's direct identifiers.
Sort what you find into two buckets. Direct identifiers name a person outright: full names, addresses, phone numbers, email, employer, ID numbers. Quasi-identifiers don't, but they combine to pinpoint: job title, age, city, dates, employer size, a rare condition. The revised Common Rule flags information whose subject's identity 'may readily be ascertained' (45 CFR 46.102), which is exactly those combinations.
If the transcript is analysis data rather than a published quote, de-identification sits inside a wider coding workflow; qualitative research transcription covers how the two fit together. Either way, cut the direct identifiers first, then decide, detail by detail, what to do with the quasi-identifiers.
Generalize the quasi-identifiers (the step people skip)
Removing direct identifiers leaves the risk that matters most. The EU's guidance on anonymization names three re-identification threats: singling out, linkability, and inference (Article 29 Working Party, WP216, 2014). Generalization is the standard defense, and it means diluting a detail by changing its scale.
In practice you widen, not delete. WP216 gives the pattern: a region rather than a city, a month or year rather than an exact date, an age band rather than a birth date. 'CFO of a 12-person Boise startup' becomes 'a finance lead at a small firm in the Mountain West.' You keep the meaning and lose the fingerprint.
Apply the same generalization everywhere, then check that two generalized details don't still combine to single someone out. One over-specific line can undo the whole pass. That's why a find-and-replace on the name is never anonymization on its own.
What about direct quotes and the master copy?
Direct quotes are where anonymization and attribution collide. Rich, verbatim detail is what makes a quote worth using and exactly what makes a speaker identifiable (Kaiser, 2009). When a quote you've pulled carries identifying specifics, generalize inside brackets: '[a colleague]' for a named person, '[a large hospital]' for the employer, so the line still reads as theirs.
Never redact your only file. Build a clean working copy with an interview transcript formatter, run the redaction pass on that copy, and keep the un-redacted master in access-controlled storage. You'll want the original if you ever need to verify a quote or restore the true attribution.
Where the audio and transcript live matters for sensitive sources. Pepys never trains on your audio or text. Source media auto-deletes 30 days after upload by default, and an unclaimed anonymous job is purged in about 12 hours. The transcript itself stays until you delete it. So an off-the-record recording isn't sitting on a server indefinitely.