Why a long recording won't upload in one piece
The problem is a per-request cap, not your file being too big in any general sense. OpenAI's Speech-to-Text API limits uploads to 25 MB per file, and its own docs say a longer file has to be broken into 25 MB chunks or re-compressed. At a typical speech bitrate near 128 kbps, file size equals bitrate × duration, so 25 MB holds only about 25 minutes of audio. A one-hour interview already overflows that; a two-hour recording never stood a chance.
There's a second, separate ceiling. On the gpt-4o-transcribe model, the API rejects any single request longer than about 1,500 seconds, roughly 25 minutes, and returns a 400 error naming the exact duration it refused. That limit is enforced by the live API rather than printed on the model's spec page, but you hit it the instant you send a long file. Whisper-1, by contrast, has no duration cap – only the 25 MB size limit.
This is why feeding a long recording to a general chatbot stalls too. If you tried to hand a two-hour file to ChatGPT and it balked, you met the same size and duration limits from the other side. The fix doesn't change: the file has to be cut into pieces before any model will read it.
Chunking is how speech-to-text already works
Splitting a long file feels like a workaround. It isn't. Speech-to-text already reads audio in short windows: the Whisper model processes audio in fixed 30-second chunks, decoding each window while conditioning on the previously transcribed text. Every one-shot transcription you've ever seen was already chunked under the hood. Cutting a long file into 20-minute pieces is the same idea, one level up.
What matters is where you cut. Slice on a silence between sentences, never mid-word: a cut through a spoken word garbles the seam and can drop or double a syllable. If you split by hand, keep a few seconds of overlap between adjacent pieces, so no word falls into the gap between two chunks. Then transcribe every piece with the same model, so wording and formatting stay consistent across the whole recording.
Re-offset the timestamps, then stitch
Each chunk restarts its clock at 00:00. That single detail breaks a naive stitch. When you transcribe piece three, its timestamps read 00:00 to 00:20:00, but that audio really lives at 00:40:00 to 01:00:00 in the source. Before you join anything, add each chunk's real start offset back to every timestamp inside it. Skip this and your two-hour transcript reads like six twenty-minute recordings stacked on top of each other, every citation pointing at the wrong moment.
The order is fixed: split, transcribe, re-offset, then concatenate into one file. Done right, the seams vanish and the transcript reads as a single continuous timeline you can cite from. Doing all four steps by hand across six chunks is tedious and easy to get wrong, which is the whole argument for automation.
A tool built for this takes the whole recording, chunks it under the caps, transcribes each piece, re-offsets the timestamps, and hands back one stitched file. That's the job an audio-to-transcript tool does: you upload a two-hour recording and get a single, correctly-timed document, without touching a seam yourself.
Long audio files fail in the quiet parts
On a long recording, the errors that bite are whole invented sentences in the silences, not the odd wrong word. A peer-reviewed audit of Whisper found that about 1% of transcriptions contained entirely hallucinated phrases, and 38% of those fabrications carried explicit harms like invented associations or false authority. Among the study's speakers with aphasia, those errors clustered where recordings had longer non-speech and silent stretches – exactly what a two-hour recording has more of.
So audit the quiet minutes first. A long pause, or a stretch of background noise while someone steps out of the room: that's where a model is most likely to fill the gap with fluent nonsense. Scan the transcript for lines that don't match what you remember from a lull, and check them against the audio. The deeper accuracy checks matter more on a long file than a short one, simply because there's more silence to trip on.
You don't re-transcribe to fix this. You spot-check. Typing a transcript from scratch runs up to six hours of work per hour of audio, so retyping a two-hour recording is most of a working week. Reading an AI draft against the audio and correcting the seams, silences, names, and numbers takes a fraction of that. And it's where your attention actually pays off.
Reading and exporting a two-hour transcript
A long transcript is only useful if you can move around it. Once the timestamps are correct and continuous, use them as an index. Mark the moments that matter: the answer where the argument turns, the number you'll quote, the exchange you'll pull. Then you jump to 01:14:20 instead of scrolling through 20,000 words. Timestamps turn a wall of text into something you can navigate by ear.
For the finished file, export to the format you'll actually work in. Pull it into a document you can read, annotate, and cite from for writing and coding, or to SRT and VTT if the recording is going out as captioned video. Keep the timestamped master, so any quote stays re-checkable against the source audio no matter how you slice the working copies later.