Get consent before you press record
Before you record a call, make sure you're allowed to. Under U.S. federal law, it's lawful to record a conversation you're a party to, or where one party has given prior consent – the one-party-consent baseline (18 U.S.C. Sec. 2511). That's the floor, not the ceiling.
State law can be stricter. About a dozen states require all-party consent (11, by the Reporters Committee's count), meaning everyone on the line has to agree. That list includes California, Florida, Illinois, Maryland, Massachusetts, Pennsylvania, and Washington. A phone interview often crosses state lines, and the safe default is to assume the stricter state's law applies. When you're not sure, ask and get a clear yes on the recording itself.
Say the date, the participants, and the consent out loud at the top of the call, so your permission is timestamped in the audio. For the full picture on what's legal where (and how consent interacts with all-party rules), work through the legal walk-through on recording and transcribing. This guide keeps consent practical; that one goes deep. We can't give legal advice, so treat a clear yes as the safe default.
Phone audio is narrowband by design
Phone calls throw away audio that speech recognition depends on. Traditional telephone lines are band-limited to 300–3400 Hz and sampled at 8 kHz (so-called narrowband), which strips out higher-frequency detail useful for telling many speech sounds apart (Liu, Fu & Narayanan, 2009). That's the baseline every phone transcript inherits.
The 8 kHz PCM standard behind that is ITU-T G.711, in force since 1988. Mobile and VoIP calls add lossy compression on top: G.729 codes speech at just 8 kbit/s to save bandwidth, discarding detail it judges inaudible. The recording never had the fidelity of an in-person mic – the codec decided that before you pressed record.
VoIP adds one more hazard: dropped packets. Packet loss leaves gaps in the audio stream when data fails to arrive, and it's a growing problem as calls move to internet telephony. A half-second dropout mid-sentence is a word your transcriber will never recover. Knowing the audio's ceiling is set this low tells you where to spend your editing time later.
Capture the call as cleanly as the line allows
You can't add fidelity a phone line stripped out, but you can avoid making the recording worse. The cleanest capture records each side separately. If the interview runs over a computer, a per-track recording gives you the source on one channel and yourself on another, which keeps the two voices from bleeding together.
When you only have the handset, use a dedicated call-recording app rather than a speakerphone in a noisy room. Speakerphone pulls in the fan, the keyboard, and the room echo on top of an already-thin signal. A wired or app-based recording of the call itself skips the room and keeps the compressed-but-clean line audio.
Record a 10-second test and play it back before the real conversation starts. A dropped connection or a muted mic caught now costs you seconds; caught after the interview, it costs you the whole thing. Note the file's start time so you can line up timestamps against your notes afterward.
One mixed line makes who-said-what harder
If both voices land on a single mixed track, labeling who said what gets harder – especially when you talk over each other. Overlapping speech is a leading source of speaker-diarization error, and in strict scoring it counts directly against the error rate (DIHARD II, 2020). Separate channels sidestep most of that, because each voice is already isolated.
On one mono line, the software has to infer the speaker turn from voice characteristics alone, and a phone call's narrow band gives it less to work with. That's reasoning, not a benchmark: the fewer acoustic cues in the signal, the more turns you'll fix by hand. Speaker labeling on a mixed track is exactly the job to lean on here, then correct the boundaries where you cut each other off.
Expect the most errors around crosstalk and quick back-and-forth. When you and the source overlap, the transcript may drop a turn or merge two people into one – so those are the spots to check against the audio. A short question you asked can get folded into the source's answer, which changes who owns the quote.
Transcribe the phone interview with an AI first pass
Once you have the recording, don't type it out. Manual transcription can take up to six hours of work for a single hour of audio (Bell et al., via Haberl et al., 2023) – most of a working day. An AI first pass built for reporting turns that into a few minutes of processing plus a focused cleanup.
From there, the workflow is the same as any interview transcript: read the draft against the audio and fix the spots AI struggles with. On a phone call that means names, jargon, numbers said fast, and every overlapped turn. Bracket anything you genuinely can't hear as [inaudible] with its timestamp rather than guessing at a quote.
Clean only the lines you'll publish, to one consistent verbatim style, and keep the timestamps so every quote stays re-checkable. A phone transcript carries more uncertain patches than a studio recording. So a source who can point back to 12:04 in the audio is worth more than a paragraph of tidy but unverifiable text.