Transcription engines resample everything to 16 kHz
Before an AI model reads a single word, it standardizes your audio. Whisper, the engine behind much of today's automatic transcription, re-samples all incoming audio to 16,000 Hz before it analyzes anything. The official implementation hard-codes a 16 kHz sample rate. So whether you feed it a 96 kHz studio master or a phone memo, the model works from the same 16 kHz picture of your sound.
That 16 kHz target isn't arbitrary. By the Nyquist–Shannon sampling theorem, a signal is fully captured when you sample at twice its highest frequency, so 16 kHz faithfully represents everything up to 8 kHz. Human speech, including the high hiss of consonants, sits almost entirely below that line. Recording at a higher sample rate won't sharpen the transcript. The engine throws the extra range away before it reads a word.
Format sets the floor for transcript quality, and this guide stays on that axis: sample rate, bitrate, lossless versus lossy, and channels. Raising accuracy above that floor is a separate lever, mostly cutting room noise and verifying the draft, that we cover elsewhere.
The best audio format for transcription is lossless, but good lossy is close
Lossless and lossy describe what a codec discards. WAV stores every sample uncompressed. FLAC compresses losslessly, so decoding returns exactly the original data at a smaller file size. MP3 and M4A are lossy: they drop detail the ear is least likely to notice. For transcription, that dropped detail rarely matters until the bitrate gets low.
That floor sits low, though. In a controlled study, MP3 compression barely dented recognition accuracy down to about 24 kbps (Pollák & Behúnek, 2011). On a clean channel, their system scored 95.2% on uncompressed WAV, 93.2% at 160 kbps, and 89.3% at 24 kbps, then fell to 21.0% at 8 kbps. A 128 kbps or higher MP3 costs you almost nothing. Squeeze the file harder and you start stripping the signal the model needs.
M4A files use AAC, the same perceptual-coding family as MP3, so the same rule applies: fine at a healthy bitrate, risky when squeezed hard. If you choose at capture, record WAV or FLAC and you never think about bitrate again. If your device only outputs compressed audio, keep it at 128 kbps or above and you'll be fine.
Phone-quality audio strips the detail models rely on
Not all recordings that reach 16 kHz are equal, and telephone audio is the cautionary case. Telephone voice is sampled at just 8,000 Hz under ITU-T G.711, and the channel is band-limited to roughly 300–3400 Hz under G.712. That passband was built for intelligible conversation. Machine transcription needs the detail it discards.
The trouble is what falls outside that band. Fricative consonants like /s/, /f/, and /sh/ carry significant energy between 3 and 8 kHz, and the Speech Intelligibility Index weights bands up to about 8 kHz. A 300–3400 Hz phone channel cuts most of that away. Strip the high-frequency detail that separates 'sip' from 'ship,' and the model has fewer cues to work with.
In practice, record at 16 kHz or higher and keep the full voice band intact. Any modern phone or recorder app clears this easily. The risk is a downstream step, a VoIP leg, a compressed voicemail, a re-export, that quietly narrows the band before the file ever reaches you.
Record one channel per speaker for cleaner speaker labels
Record one channel per speaker. Mono or stereo barely matters. When two people share one mixed track, the tool has to guess who spoke when. Give each speaker a dedicated channel and that guesswork disappears, because the channel itself marks who is talking. That's the gap between hoping for good speaker labels and building them in at capture.
Per-track recording bakes that in. Zoom can record a separate audio file for each participant, and ASR vendors report that multichannel recordings transcribe more accurately because speakers are pre-separated at capture. With one voice per channel, the model never has to pull overlapping speakers apart from a single mixed signal.
For a two-person remote interview, per-track recording is the single biggest upgrade you can make. For an in-person conversation, a separate lav mic per speaker does the same job. The full interview-capture workflow, micing each speaker close and stating names up front, builds directly on this idea.
Noise and mic distance matter more than the codec
Once the format is sensible, the codec stops mattering and the room takes over. Recognition accuracy holds steady while the signal-to-noise ratio stays at 5 dB or above, then degrades sharply below it. In one evaluation, a model scored 0.12 WER on clean speech and 0.79 once noise and network distortion piled on. Background noise wrecks a transcript; the file type rarely does.
Two things protect that signal-to-noise ratio: put the mic close to the speaker, and record somewhere quiet. A lav clipped near the mouth captures a far stronger signal than a laptop across the room. Bit depth barely moves the needle here. 16-bit audio already spans about 96 dB of dynamic range, far more than any speech recording uses. So there's little reason to reach for 24-bit for transcription alone.
Format sets a sensible floor, but most of your accuracy gains come from the recording itself: a close mic, a quiet room, and one channel per speaker. Get those right, and a modest MP3 will out-transcribe a pristine WAV captured across a noisy café.