Guide

The best audio format for transcription

A plain guide to sample rate, bitrate, lossless versus lossy, and channels, and which of them actually changes your transcript.

The short answer

For transcription, record lossless WAV or FLAC when you can, at 16 kHz or higher, 16-bit, with one channel per speaker. Modern engines like Whisper resample everything to 16 kHz anyway, so a clean 128 kbps MP3 transcribes almost as well. What actually moves accuracy is mic distance and low background noise, not the file format.

Transcription engines resample everything to 16 kHz

Before an AI model reads a single word, it standardizes your audio. Whisper, the engine behind much of today's automatic transcription, re-samples all incoming audio to 16,000 Hz before it analyzes anything. The official implementation hard-codes a 16 kHz sample rate. So whether you feed it a 96 kHz studio master or a phone memo, the model works from the same 16 kHz picture of your sound.

That 16 kHz target isn't arbitrary. By the Nyquist–Shannon sampling theorem, a signal is fully captured when you sample at twice its highest frequency, so 16 kHz faithfully represents everything up to 8 kHz. Human speech, including the high hiss of consonants, sits almost entirely below that line. Recording at a higher sample rate won't sharpen the transcript. The engine throws the extra range away before it reads a word.

Format sets the floor for transcript quality, and this guide stays on that axis: sample rate, bitrate, lossless versus lossy, and channels. Raising accuracy above that floor is a separate lever, mostly cutting room noise and verifying the draft, that we cover elsewhere.

The best audio format for transcription is lossless, but good lossy is close

Lossless and lossy describe what a codec discards. WAV stores every sample uncompressed. FLAC compresses losslessly, so decoding returns exactly the original data at a smaller file size. MP3 and M4A are lossy: they drop detail the ear is least likely to notice. For transcription, that dropped detail rarely matters until the bitrate gets low.

That floor sits low, though. In a controlled study, MP3 compression barely dented recognition accuracy down to about 24 kbps (Pollák & Behúnek, 2011). On a clean channel, their system scored 95.2% on uncompressed WAV, 93.2% at 160 kbps, and 89.3% at 24 kbps, then fell to 21.0% at 8 kbps. A 128 kbps or higher MP3 costs you almost nothing. Squeeze the file harder and you start stripping the signal the model needs.

M4A files use AAC, the same perceptual-coding family as MP3, so the same rule applies: fine at a healthy bitrate, risky when squeezed hard. If you choose at capture, record WAV or FLAC and you never think about bitrate again. If your device only outputs compressed audio, keep it at 128 kbps or above and you'll be fine.

Phone-quality audio strips the detail models rely on

Not all recordings that reach 16 kHz are equal, and telephone audio is the cautionary case. Telephone voice is sampled at just 8,000 Hz under ITU-T G.711, and the channel is band-limited to roughly 300–3400 Hz under G.712. That passband was built for intelligible conversation. Machine transcription needs the detail it discards.

The trouble is what falls outside that band. Fricative consonants like /s/, /f/, and /sh/ carry significant energy between 3 and 8 kHz, and the Speech Intelligibility Index weights bands up to about 8 kHz. A 300–3400 Hz phone channel cuts most of that away. Strip the high-frequency detail that separates 'sip' from 'ship,' and the model has fewer cues to work with.

In practice, record at 16 kHz or higher and keep the full voice band intact. Any modern phone or recorder app clears this easily. The risk is a downstream step, a VoIP leg, a compressed voicemail, a re-export, that quietly narrows the band before the file ever reaches you.

Record one channel per speaker for cleaner speaker labels

Record one channel per speaker. Mono or stereo barely matters. When two people share one mixed track, the tool has to guess who spoke when. Give each speaker a dedicated channel and that guesswork disappears, because the channel itself marks who is talking. That's the gap between hoping for good speaker labels and building them in at capture.

Per-track recording bakes that in. Zoom can record a separate audio file for each participant, and ASR vendors report that multichannel recordings transcribe more accurately because speakers are pre-separated at capture. With one voice per channel, the model never has to pull overlapping speakers apart from a single mixed signal.

For a two-person remote interview, per-track recording is the single biggest upgrade you can make. For an in-person conversation, a separate lav mic per speaker does the same job. The full interview-capture workflow, micing each speaker close and stating names up front, builds directly on this idea.

Noise and mic distance matter more than the codec

Once the format is sensible, the codec stops mattering and the room takes over. Recognition accuracy holds steady while the signal-to-noise ratio stays at 5 dB or above, then degrades sharply below it. In one evaluation, a model scored 0.12 WER on clean speech and 0.79 once noise and network distortion piled on. Background noise wrecks a transcript; the file type rarely does.

Two things protect that signal-to-noise ratio: put the mic close to the speaker, and record somewhere quiet. A lav clipped near the mouth captures a far stronger signal than a laptop across the room. Bit depth barely moves the needle here. 16-bit audio already spans about 96 dB of dynamic range, far more than any speech recording uses. So there's little reason to reach for 24-bit for transcription alone.

Format sets a sensible floor, but most of your accuracy gains come from the recording itself: a close mic, a quiet room, and one channel per speaker. Get those right, and a modest MP3 will out-transcribe a pristine WAV captured across a noisy café.

Tips from people who do this a lot

Record WAV or FLAC when your device allows it. You never have to think about bitrate again, and the file decodes back to the exact original.
If you can only get MP3 or M4A, keep it at 128 kbps or higher. A controlled study showed accuracy barely moved above about 24 kbps but collapsed to 21% at 8 kbps.
Don't bother recording above 16 kHz for transcription alone. Engines like Whisper resample down to 16 kHz, so the extra range is discarded before analysis.
Watch for hidden downsampling. A VoIP call, a voicemail, or a re-compressed export can quietly narrow a good recording to phone-quality band.
One channel per speaker beats any format upgrade for who-said-what. Use per-track recording remotely, or a separate mic per person in the room.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

Best audio format for transcription – questions, answered

What's the best audio format for transcription?

Lossless WAV or FLAC at 16 kHz or higher, 16-bit, with one channel per speaker. But a clean 128 kbps MP3 transcribes almost as well, because engines like Whisper resample all audio to 16 kHz before analysis. Format sets a floor; a close mic and a quiet room do the real work.

Does MP3 hurt transcription accuracy?

Only at very low bitrate. In a controlled 2011 study, MP3 recognition accuracy held near uncompressed WAV down to about 24 kbps (95.2% versus 89.3%), then collapsed to 21% at 8 kbps. At 128 kbps or higher, the loss versus lossless is negligible for most recordings.

What sample rate should I record at for transcription?

16 kHz is the practical minimum and the target most engines use. By the Nyquist theorem, 16 kHz captures everything up to 8 kHz, which covers human speech. Recording higher won't improve the transcript, since the model resamples to 16 kHz anyway, but it does no harm.

Should I record in mono or stereo?

Neither matters as much as channels per speaker. A single mixed track forces the tool to guess who spoke. Recording each speaker on a separate channel, via per-track Zoom recording or individual mics, gives cleaner speaker labels because voices are separated at capture, not untangled afterward.

Why does phone audio transcribe worse?

Telephone channels sample at 8 kHz and band-limit voice to roughly 300–3400 Hz under ITU-T standards. That cuts high-frequency detail: fricatives like /s/ and /f/ carry energy between 3 and 8 kHz. With that detail gone, a model has fewer cues to tell similar words apart.

References

1.Radford et al. (2022), Reliable Speech Recognition via Large-Scale Weak Supervision – all audio re-sampled to 16 kHz – OpenAI / arXiv
2.OpenAI Whisper source (whisper/audio.py) – SAMPLE_RATE = 16000 – OpenAI
3.Shannon (1949), Communication in the Presence of Noise – Nyquist sampling theorem – Proceedings of the IRE
4.ITU-T Recommendation G.711 – PCM of voice frequencies (8 kHz sampling) – ITU-T
5.ITU-T Recommendation G.712 – PCM channel passband (roughly 300–3400 Hz) – ITU-T
6.IETF RFC 9639 – Free Lossless Audio Codec (FLAC) – IETF
7.Pollák & Behúnek (2011), Accuracy of MP3 Speech Recognition Under Real-World Conditions – SCITEPRESS / SIGMAP 2011
8.Wiseman et al. (2025), The Speech Intelligibility Index: Tutorial and Applications – American Journal of Audiology
9.Monson et al. (2014), The Perceptual Significance of High-Frequency Energy in the Human Voice – Frontiers in Psychology
10.Kumalija & Nakamoto (2022), Performance evaluation of ASR on noise-network distorted speech – Frontiers in Signal Processing
11.Starting a computer recording – separate audio file per participant – Zoom Support
12.Using multichannel and speaker diarization – AssemblyAI
13.Audio bit depth – dynamic range of 16-bit PCM – Wikipedia / Analog Devices (Walt Kester)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing