Guide

How to transcribe user interviews

A working guide for UX and qualitative researchers: how to go from a recorded session to speaker-labeled, tagged, quotable text you can drop into your repository.

The short answer

To transcribe a user interview, record each speaker on a separate track, then upload the file to get a speaker-labeled, timestamped draft in minutes instead of hours of typing. Clean the moderator and participant labels, tag the transcript for themes, and export to DOCX or JSON so quotes and highlight clips drop into your research repository, each tied to the exact second it was said.

Record with diarization in mind

Speaker separation starts at the microphone, not in the software. For a moderated session, record each person on their own track where the platform allows it. A separate file for the moderator and the participant makes automatic speaker labeling far cleaner, because the tool isn't guessing who's talking when the two of you overlap. If you only have one mixed file, that's fine – you'll just fix more speaker turns by hand. For the recording-quality basics like mic placement, room noise, and remote per-channel capture, start with the interview transcription hub and come back here for the research-specific steps.

You'll do this a lot, so a repeatable setup pays off. Most usability findings surface after only a handful of participants: Jakob Nielsen's long-standing advice is to test with no more than five users and run as many small studies as you can afford. That means a steady stream of short recordings across rounds, not one marathon file. Nail the capture once and every session after it is easier.

Open every session the same way. Say the date, the study name, and who's in the room into the recording before the first question. It timestamps your consent, anchors which voice is the moderator, and saves you re-listening later just to work out who 'Speaker 1' is.

Transcribe the user interview, then clean the labels your quotes hang on

Typing a session by hand is the slowest part of the job. Manual transcription of a one-hour interview can run up to six hours of work, most of a day per participant. An AI first pass turns that into a few minutes of processing plus a focused cleanup. It returns the session already split into moderator and participant turns, with timestamps you can cite.

Fix the labels before you tag anything. Diarization is accurate, but it can swap a speaker during crosstalk or split one person into two labels across a long session. Rename a speaker once so the change flows through the whole transcript, then skim the turns where you and the participant talk over each other. Getting who-said-what right now saves you from mis-attributing a quote later.

Word-level timestamps are what make a highlight clip possible. Because each line carries the exact second it was said, you can find the moment a participant froze on a confusing screen. Jump straight back to that point in the recording and cut a short clip for stakeholders. A quote with a timestamp is evidence. A quote without one is a memory.

Tag the transcript, then cluster the tags into themes

Tagging comes before themes, not the other way around. If your analysis is thematic, the common reference is Braun and Clarke's method, which describes an accessible, flexible approach to analysing qualitative data. You read each transcript, tag every relevant line with a short code, and only then look for patterns across the codes. Coding first keeps you from bending the data toward the story you expected.

Clustering the codes is where themes appear. Affinity diagramming is the standard move: organizing related observations and findings into distinct clusters until a theme names itself. The technique is also called the KJ method, after Jiro Kawakita. Do it on sticky notes or in a whiteboard tool. The point is grouping by natural similarity, not by the question you happened to ask.

Search is how you confirm a theme is real. Once a candidate theme emerges, a searchable transcript lets you find every mention of the idea across sessions in seconds, instead of scrubbing a dozen recordings one by one. That's the difference between 'a few people said this' and a counted, quotable pattern you can defend.

Export into a repository your team can search

A finding no one can find isn't a finding. A research repository is a central place where research artifacts are stored so others can access them, and searched by keyword to find insight quickly. Export each transcript in a format your repository ingests, tag it with the study and participant group, and the next researcher finds it without asking you.

Match the export to the destination. For a written appendix or a repository that takes documents, export a clean speaker-labeled DOCX. For import into NVivo, Atlas.ti, or a structured repository, JSON carries the speakers and timestamps as data you can query. SRT or VTT keep timed captions if you're archiving the video alongside the transcript. Pepys exports TXT, DOCX, PDF, SRT, VTT, and JSON.

Keep the timestamps in whatever you export. A transcript stripped of its timing is far weaker in a repository: a colleague can read the quote but can't jump to the audio to hear tone, hesitation, or context. The timestamped version is the one that earns trust when someone reuses your data a year later.

Consent, confidentiality, and where the recordings live

Get consent to record, and capture it in the recording itself. Recording law varies by jurisdiction – some places need only one party's consent, others require everyone's – so ask for a clear yes before the substance starts. Under the GDPR, an identifiable participant on a recording is a data subject, and their consent must be freely given, specific, informed, and unambiguous. This is general information, not legal advice.

Confidentiality is a promise you keep in the transcript. When a participant needs protecting, work on a copy and strip the details that re-identify them, from names and employers to a rare job title, while keeping the un-redacted master somewhere access-controlled. Removing the name alone rarely does it, because combined details still point to a person.

Mind where the audio lives for IRB or privileged material. Use a tool that doesn't train models on your files and doesn't hold the recording forever. Pepys never trains on your audio, and by default the source recording is deleted 30 days after upload, while your transcript and every export are kept for as long as you need them.

The steps, in order

01
Record each speaker on a separate track
Capture the moderator and participant on their own channels where the platform allows it, cut background noise, and state the date, study name, and who's in the room before the first question.
02
Upload for an AI first pass
Drop the recording in or paste a link and get a speaker-labeled, timestamped draft in minutes instead of up to six hours of manual typing.
03
Clean the speaker labels
Rename the moderator and participant once so it updates throughout, and fix the turns where you and the participant overlap before you tag anything.
04
Tag lines, then cluster into themes
Assign a short code to each relevant line, then group related codes into clusters with affinity diagramming until the recurring themes surface.
05
Export into your research repository
Export a speaker-labeled DOCX for write-ups or JSON for NVivo and Atlas.ti, keeping timestamps so every quote links back to the exact second it was said.

Tips from people who do this a lot

Record the moderator and participant on separate tracks. It's the single biggest upgrade to speaker labeling, more than any setting in the transcription tool.
Fix the diarization labels before you tag. A mis-attributed quote caught at export is far more expensive than a rename at the start of analysis.
Build a highlight index as you read: note the timestamp of each moment worth showing stakeholders, then cut the clips from those points instead of re-scrubbing.
Tag first, theme second. Code every relevant line before you name a single theme, or you'll bend the data toward the story you expected to find.
Keep timestamps in the exported file. A repository transcript that has lost its timing can be read but not verified against the audio.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

How to transcribe user interviews – questions, answered

How do you transcribe a user interview?

Record each speaker on a separate track, then upload the file to get a speaker-labeled, timestamped draft in minutes. Clean the moderator and participant labels, tag the lines for themes, and export to DOCX or JSON so quotes drop into your repository already attributable. That's far faster than typing, which can run up to six hours per audio hour.

How do I separate the moderator from the participant?

Record each on their own channel where you can, so diarization isn't guessing during crosstalk. Speaker labeling then returns clear moderator and participant turns rather than one block. Rename a speaker once and it updates throughout. With a single mixed file you'll still get labels, but expect to fix more turns around overlapping speech.

How do timestamps help with highlight clips?

Word-level timestamps tie each line to the exact second it was said, so you can find the moment a participant hit a snag and jump straight to that point to cut a short clip. A timestamped quote also links back to the audio for anyone who reuses the data later, which keeps the evidence trustworthy.

How should I tag a transcript for themes?

If your analysis is thematic, assign a short code to each relevant line first, then cluster related codes with affinity diagramming until themes emerge. Braun and Clarke's method is the common reference. Search the transcript to confirm a theme recurs across sessions rather than trusting a single memorable quote.

Is it private enough for identifiable participants?

Get consent captured in the recording, and use a tool that doesn't train on your files. Pepys never trains on your audio, and by default the source recording is deleted 30 days after upload while your transcript and exports are kept. For confidentiality, anonymize a copy and keep the un-redacted master access-controlled.

References

1.Haberl et al. (2023), 'Take the aTrain', arXiv:2310.11967, citing Bell et al. (2018) – arXiv / University of Graz
2.Jakob Nielsen, NNG, 'Why You Only Need to Test with 5 Users' (2000) – Nielsen Norman Group
3.Krause & Pernice, NNG, 'Affinity Diagramming' (2024) – Nielsen Norman Group
4.Interaction Design Foundation, 'Affinity Diagrams' – Interaction Design Foundation
5.Maria Rosala, NNG, 'Research Repositories' (2024) – Nielsen Norman Group
6.RCFP, 'Introduction to the Reporter's Recording Guide' – Reporters Committee for Freedom of the Press
7.GDPR Recital 26 (Regulation (EU) 2016/679) – gdpr-info.eu (reproduces official EU Regulation 2016/679)
8.GDPR consent standard (Art. 7 / Recital 32) – gdpr-info.eu (reproduces official EU Regulation 2016/679)
9.Braun & Clarke (2006), 'Using thematic analysis in psychology', Qualitative Research in Psychology 3(2):77-101 – Taylor & Francis / Qualitative Research in Psychology

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing