Guide

Can ChatGPT transcribe audio?

What ChatGPT actually does with speech, and why a diarized, exportable transcript still needs a tool made for recordings.

The short answer

Partly. ChatGPT's consumer app can't take a direct audio upload – OpenAI's own docs list audio as an unsupported file type – so you can't drop in an MP3 and get a transcript. Voice Mode transcribes your live spoken conversation, and the API can transcribe short clips, but neither hands you a long, speaker-labeled, timestamped transcript file. For that, use a dedicated transcription tool.

Can ChatGPT transcribe audio you upload?

Not directly, at least not in the consumer app. OpenAI's own help docs list audio and video as unsupported upload types, so dropping an MP3, WAV, or M4A into a chat won't produce a transcript. The general per-file upload ceiling is 512MB, but that covers documents, spreadsheets, slides, and images. Audio isn't on the list.

The '25MB' figure you'll see in dozens of blog posts is real, but it belongs to the API, not the chat window. OpenAI's developer speech-to-text guide caps API audio uploads at 25MB and returns plain text by default. Even then, the current model rejects a single request longer than 1500 seconds, about 25 minutes, so a full interview has to be split up first.

So the honest answer splits in two. Through code, you can transcribe short clips. Through the app most people actually use, you can't upload audio at all. Turning a real recording into a speaker-labeled, exportable transcript is a job a general chat assistant simply isn't set up to do.

What about Voice Mode – doesn't it transcribe?

Voice Mode transcribes you, not your recordings. OpenAI describes Advanced Voice Mode as real-time, fluid conversation that even handles interruptions. You speak, it replies. When you exit, a transcription of that live exchange is added to your text chat, so there is a written record of what you both said.

That's a conversation transcript, not a transcription service. Voice Mode isn't built to ingest a played-back or uploaded recording and hand you back a faithful file. Hold your phone up to a lecture and you might capture fragments as chat text, but ChatGPT's turn-taking shapes them into an unlabeled, untimed blob. The room's actual dialogue gets lost.

Voice Mode won't replace a transcription tool. For a quick spoken note to yourself, it's fine. For an interview you need to quote accurately, it isn't. It's optimized for dialogue, not documentation, and the difference shows the moment you need an exact line.

What ChatGPT won't give you: speaker labels, timestamps, or a file

Even where OpenAI's models do transcribe, three things go missing by default: speaker labels, timestamps, and a downloadable file. Diarization, knowing who said what, isn't in the standard output. OpenAI shipped a separate model, gpt-4o-transcribe-diarize, specifically to add speaker segments, and it's API-only, absent from the chat app.

Timestamps tell the same story. They only appear when you pass a specific parameter, and per OpenAI's docs that parameter works only on the older whisper-1 model. Ask for a transcript in the chat and you get a wall of plain text: no who-said-what, no line you can jump back to in the audio to verify.

And there's no transcript export. Whatever text lands in the chat is copyable, but ChatGPT has no built-in way to hand you a TXT, DOCX, SRT, or VTT file. If you need captions as a timed SRT file, or a formatted document to code and cite from, you're rebuilding it by hand from pasted text.

How accurate is OpenAI's transcription?

OpenAI's current speech-to-text is strong, but not flawless. gpt-4o-transcribe replaced the older Whisper model with a better word error rate, and TechCrunch reported the new models hallucinate less at their March 2025 launch. That newer model is the engine behind OpenAI transcription today; raw Whisper no longer runs under the hood.

The caveat matters, because AI transcription can invent text nobody spoke. A peer-reviewed audit ran 13,140 audio segments through the original Whisper. It found roughly 1% held entirely fabricated phrases, and 38% of those hallucinations carried explicit harms (Koenecke et al., 2024). Newer models hallucinate less, but 'less' isn't 'never', which is why you read a transcript against the audio before quoting.

For scale, professional human transcribers hit about 5.9% word error rate on conversational speech (Xiong et al., 2016), a bar good ASR now approaches on clean audio. The real gap isn't raw accuracy anymore. It's everything around the words: who spoke when, and a file you can cite from.

So when do you actually need a dedicated tool?

One caution before you paste a sensitive transcript into ChatGPT: consumer chats may train the model. OpenAI's data-usage FAQ says consumer content may be used to train its models unless you opt out, while the API and business tiers are excluded by default. For IRB, privileged, or off-the-record audio, that default cuts the wrong way.

So, can ChatGPT transcribe audio? For a short clip through the API, or a quick spoken note in Voice Mode, yes. For a long interview, lecture, or deposition that has to be diarized, timestamped, and exported as a citable file, no. That's a different job, and manual cleanup instead can run up to six times the length of the audio.

The faster path is to upload your recording to a purpose-built tool, get a speaker-labeled draft back in minutes, then clean only the quotes you'll publish. That's the full interview-transcription workflow, and it's where a transcription tool earns its keep over a chat window.

Tips from people who do this a lot

The 25MB limit you see everywhere is the API's, not the app's. Don't plan a workflow around uploading audio into the ChatGPT chat window, because it won't accept the file at all.
Voice Mode's chat transcript is shaped by turn-taking, not fidelity. Don't hold your phone up to an interview and expect a faithful record. What you get back is ChatGPT's version of the exchange, with the room's real dialogue smoothed away.
If you go the API route, the current model caps a single request at 25 minutes (1500 seconds). A one-hour interview has to be chunked, re-offset, and stitched back together to transcribe fully.
For sensitive audio, turn off training first: in consumer ChatGPT that's Settings, then Data Controls. Better yet, use a route excluded from training by default, like the API or a tool that never trains on your files.
Whatever the model, read the transcript against the audio before you quote. Hallucinated lines are rare but real, and they read as fluent, confident sentences that were never actually spoken.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link

60 min free · no card required · we never train on your audio

Trusted by 100,000+ creators, podcasters, journalists & researchers

Can chatgpt transcribe audio – questions, answered

Can you upload an MP3 to ChatGPT to transcribe it?

Not in the consumer app. OpenAI's help documentation lists audio and video as unsupported file types, so an MP3, WAV, or M4A won't upload into a chat for transcription. The 25MB audio limit people cite belongs to OpenAI's developer API, not the ChatGPT app you type into.

Does ChatGPT Voice Mode create a transcript?

It transcribes your live spoken conversation, but only while you're speaking to it. After you exit Voice Mode, a transcription of that exchange lands in your text chat. It won't take a played-back or uploaded recording and return a clean, speaker-labeled file. That's outside what a live dialogue feature is meant to do.

Can ChatGPT add speaker labels and timestamps?

No, not in the chat app. Speaker diarization comes from a separate API-only model, and timestamps require a parameter that works only on OpenAI's older whisper-1 model. In a normal chat you get plain text, with no who-said-what and no line-level timing to check against the audio.

Is it safe to transcribe sensitive audio in ChatGPT?

Be careful. OpenAI's data-usage FAQ says consumer ChatGPT content may be used to train its models unless you opt out, while the API and business tiers are excluded by default. For IRB, privileged, or off-the-record material, use a tool that doesn't train on your files.

What's the best way to transcribe a long recording, then?

Upload it to a dedicated transcription tool. You'll get a diarized, timestamped draft in minutes and can export it as TXT, DOCX, SRT, or VTT. That beats a chat window, which can't take the file, label speakers, timestamp lines, or hand you a document to cite from.

References

1.What types of files are supported? – OpenAI Help Center
2.File Uploads FAQ (512MB per-file limit) – OpenAI Help Center
3.Speech to text guide (25MB API limit; timestamp_granularities) – OpenAI Developer Docs
4.gpt-4o-transcribe audio length limits (1500-second cap) – OpenAI Developer Community
5.Voice Mode FAQ – OpenAI Help Center
6.GPT-4o Transcribe Diarize model (speaker labels, API-only) – OpenAI Developer Docs
7.GPT-4o Transcribe model (supersedes Whisper) – OpenAI Developer Docs
8.OpenAI upgrades its transcription and voice-generating AI models – TechCrunch (Kyle Wiggers, 2025)
9.Koenecke et al. (2024), Careless Whisper: Speech-to-Text Hallucination Harms – ACM FAccT 2024 (peer-reviewed)
10.Xiong et al. (2016), Achieving Human Parity in Conversational Speech Recognition – Microsoft Research (arXiv:1610.05256)
11.Haberl et al. (2023), Take the aTrain (manual transcription time cost) – arXiv / University of Graz
12.Data Usage for Consumer Services FAQ – OpenAI Help Center

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Ask ChatGPT Ask Claude Ask Perplexity

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.

Start free – 60 minutes or see pricing