Pepys

Guide

How to anonymize an interview transcript

A working guide for researchers, journalists, and anyone who promised a source confidentiality – what to strip, what to generalize, and where the un-redacted master should live.

The short answer

To anonymize an interview transcript, work on a copy, not the master. First strip direct identifiers – names, addresses, phone numbers, employers, ID numbers. Then generalize the quasi-identifiers that still single someone out: a city becomes a region, an exact date becomes a month, a rare job title becomes a category. Removing the name alone isn't anonymization – combined details re-identify people.

Why removing names isn't anonymization

Deleting names is the easy 20%. The hard part is that a person can still be singled out by the details around the name. Just a 5-digit ZIP code, gender, and date of birth uniquely identify roughly 87% of the US population (Sweeney, 2000), a figure later recomputed at about 63% on 2000 census data (Golle, 2006). Either way, three ordinary facts, no name required, can be enough.

In qualitative research this leak has a name: deductive disclosure. It happens when the traits of an individual or group make them identifiable in a research report (Kaiser, 2009), even after the obvious identifiers are gone. A quote about 'the only woman on the board of a 12-person Boise startup' names no one and still points at exactly one person.

This guide assumes you already have a clean, corrected transcript. If you're not there yet, start with how to transcribe an interview and finish the AI first pass and manual cleanup first. Anonymization is a separate, later pass, and you run it on a copy, never the original.

Is your transcript anonymized or only pseudonymized?

The distinction is legal, not cosmetic. Under GDPR, pseudonymization means data that can no longer be attributed to a person without separately-kept additional information (GDPR Article 4(5)). Swapping names for 'P07' while keeping a code key is pseudonymization, and pseudonymized data is still personal data (GDPR Recital 26).

True anonymization is a higher bar. Recital 26 says you judge it by the means 'reasonably likely to be used' to re-identify someone, weighing cost, time, and available technology. So a code key locked in a drawer doesn't make a transcript anonymous. It makes it pseudonymized, and still regulated as personal data.

For most researchers and journalists, pseudonymization plus careful generalization is the realistic target. Full, irreversible anonymization is hard to guarantee once you keep rich quotes. Know which one you've actually achieved before you promise a source confidentiality, because the two carry different obligations.

Which identifiers do you remove to anonymize an interview transcript?

Start from an established checklist. HIPAA's Safe Harbor method lists 18 specific identifiers to remove (45 CFR 164.514(b)(2)), from names and geographic units smaller than a state through to any other unique identifying code. It's written for health data, but the list is a solid baseline for any transcript's direct identifiers.

Sort what you find into two buckets. Direct identifiers name a person outright: full names, addresses, phone numbers, email, employer, ID numbers. Quasi-identifiers don't, but they combine to pinpoint: job title, age, city, dates, employer size, a rare condition. The revised Common Rule flags information whose subject's identity 'may readily be ascertained' (45 CFR 46.102), which is exactly those combinations.

If the transcript is analysis data rather than a published quote, de-identification sits inside a wider coding workflow; qualitative research transcription covers how the two fit together. Either way, cut the direct identifiers first, then decide, detail by detail, what to do with the quasi-identifiers.

Generalize the quasi-identifiers (the step people skip)

Removing direct identifiers leaves the risk that matters most. The EU's guidance on anonymization names three re-identification threats: singling out, linkability, and inference (Article 29 Working Party, WP216, 2014). Generalization is the standard defense, and it means diluting a detail by changing its scale.

In practice you widen, not delete. WP216 gives the pattern: a region rather than a city, a month or year rather than an exact date, an age band rather than a birth date. 'CFO of a 12-person Boise startup' becomes 'a finance lead at a small firm in the Mountain West.' You keep the meaning and lose the fingerprint.

Apply the same generalization everywhere, then check that two generalized details don't still combine to single someone out. One over-specific line can undo the whole pass. That's why a find-and-replace on the name is never anonymization on its own.

What about direct quotes and the master copy?

Direct quotes are where anonymization and attribution collide. Rich, verbatim detail is what makes a quote worth using and exactly what makes a speaker identifiable (Kaiser, 2009). When a quote you've pulled carries identifying specifics, generalize inside brackets: '[a colleague]' for a named person, '[a large hospital]' for the employer, so the line still reads as theirs.

Never redact your only file. Build a clean working copy with an interview transcript formatter, run the redaction pass on that copy, and keep the un-redacted master in access-controlled storage. You'll want the original if you ever need to verify a quote or restore the true attribution.

Where the audio and transcript live matters for sensitive sources. Pepys never trains on your audio or text. Source media auto-deletes 30 days after upload by default, and an unclaimed anonymous job is purged in about 12 hours. The transcript itself stays until you delete it. So an off-the-record recording isn't sitting on a server indefinitely.

The steps, in order

  1. 01

    Work on a copy, never the master

    Duplicate the corrected transcript and make every change on the copy. Keep the original, un-redacted file in access-controlled storage so you can still verify a quote or its attribution later.

  2. 02

    Remove the direct identifiers

    Strip everything that names a person outright: full names, addresses, phone numbers, emails, employers, and ID numbers. Replace each with a consistent role label or code, not a blank gap.

  3. 03

    Find the quasi-identifiers

    Scan for details that combine to single someone out: job title, exact age or birth date, city, employer size, rare conditions, or unusual events. A ZIP, gender, and birth date alone can identify most people.

  4. 04

    Generalize, don't just delete

    Widen each quasi-identifier by one scale: a region instead of a city, a month or year instead of a date, an age band instead of an age. Preserve the meaning while removing the fingerprint.

  5. 05

    Check for residual re-identification

    Read the whole transcript and test whether any two remaining details still point at one person. Fix over-specific quotes by generalizing inside brackets before you share or publish.

  6. 06

    Store any code key separately

    If you mapped names to codes, keep that key in a separate, protected location. A transcript with a reachable key is pseudonymized, not anonymized, and still counts as personal data.

Tips from people who do this a lot

  • Removing the name is the smallest part of the job. A 5-digit ZIP, gender, and date of birth alone uniquely identify most of the US population, so hunt the quasi-identifiers hardest.

  • Replace, don't blank. Swapping a name for '[Manager]' keeps the sentence readable and the analysis codable; a black bar or empty space breaks the flow and hides who's speaking.

  • Be consistent: the same person should get the same pseudonym or role label everywhere, or a careful reader will stitch the fragments back together.

  • Watch the small-dataset problem – in a 12-person team or a rare specialty, even 'a senior manager' can name someone. Generalize to the group, not the role.

  • A code key is not anonymization. If you can reverse the mapping, so can anyone who reaches the key, so store it apart from the transcript and treat the file as still-personal data.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

How to anonymize interview transcript – questions, answered

Does removing names anonymize a transcript?

No. Names are direct identifiers, but a person can still be singled out by combined details. A 5-digit ZIP, gender, and date of birth uniquely identify roughly 87% of Americans (Sweeney, 2000; revised to ~63% by Golle, 2006). Real de-identification means generalizing quasi-identifiers, not just deleting the name.

What's the difference between anonymized and pseudonymized?

Pseudonymization swaps identifiers for codes but keeps a separate key that can reverse it, so under GDPR it's still personal data. Anonymization means re-identification isn't reasonably likely by any means available. A code key kept in a drawer makes a transcript pseudonymized, not anonymous.

Which identifiers should I remove from a transcript?

Start with direct identifiers: names, addresses, phone numbers, emails, employers, and ID numbers. HIPAA's Safe Harbor method lists 18 such identifiers as a baseline. Then handle quasi-identifiers, like job title, city, exact dates, and employer size, that combine to re-identify someone.

How do I anonymize a direct quote without ruining it?

Generalize inside brackets instead of deleting. Replace a named person with '[a colleague]', an employer with '[a large hospital]', and an exact location with a region. The quote keeps the speaker's words and meaning while losing the specifics that point at one person.

Is it enough to keep a separate key linking names to codes?

That's pseudonymization, not anonymization. GDPR treats data as still identifiable if additional information could re-attribute it, judged by means reasonably likely to be used. Keep any key in separate, protected storage, and don't assume a coded transcript is safe to publish.

References

  1. 1.Sweeney 2000, Data Privacy Working Paper 3 (Carnegie Mellon)Carnegie Mellon University Data Privacy Lab
  2. 2.Golle 2006, 'Revisiting the Uniqueness of Simple Demographics in the US Population' (WPES'06, ACM)ACM Workshop on Privacy in the Electronic Society (peer-reviewed)
  3. 3.Kaiser 2009, Qualitative Health Research 19(11):1632–1641PMC / U.S. National Institutes of Health (peer-reviewed article)
  4. 4.GDPR Article 4(5) (Regulation (EU) 2016/679)gdpr-info.eu (EU regulation text; canonical = EUR-Lex)
  5. 5.GDPR Recital 26 (Regulation (EU) 2016/679)gdpr-info.eu (EU regulation text; canonical = EUR-Lex)
  6. 6.45 CFR § 164.514(b)(2) – HIPAA Safe HarborCornell Law School Legal Information Institute (official CFR text)
  7. 7.45 CFR § 46.102(e)(5) – Common Rule definitionsCornell Law School Legal Information Institute (official CFR text)
  8. 8.Article 29 Data Protection Working Party, Opinion 05/2014 on Anonymisation Techniques (WP216)European Commission / Article 29 Working Party (EU standards body)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.