What does 'cheapest' really mean for transcription?
Cheapest splits three ways: dollars, time, and quality – and the option that wins on price often loses on the other two. Transcribing by hand runs up to six hours of work per hour of audio (Haberl et al., 2023). So 'free' typing is the most expensive method the moment you value your own time. The real question is which method is cheapest for your volume and your accuracy bar.
Price also hides a quality cost. A cheap or free transcript that mislabels names, drops numbers, or garbles a quote isn't cheap. You pay it back in correction time, or worse, in a wrong published quote. Accuracy is part of the cost, so read any 'free' claim as free in dollars, billed elsewhere.
Here's the ranking, cheapest first: open-source Whisper you run yourself, then capped free tiers, then per-minute AI, then pay-once AI, then paid human transcription. Each step up buys you less setup, more reliability, or both. The right choice depends on how much audio you have and how much a mistake costs you.
Is free, open-source transcription the cheapest way to transcribe audio?
In raw dollars, yes. OpenAI's Whisper ships its code and model weights under the MIT license, so you can download it and transcribe locally for nothing. The catch is everything around the model. You supply the computer, the setup, and the support. Free software is not the same as free transcription.
Bigger Whisper models are compute-hungry. The largest model needs about 10 GB of VRAM, and its published speeds are benchmarked on an A100 GPU – data-center hardware most people don't own. On a laptop CPU, a long recording can take longer to process than the audio runs. Cheap in dollars, slow in wall-clock time.
Whisper also doesn't tell you who said what. Speaker labeling isn't in its feature list; you add it with a separate tool like WhisperX, which layers pyannote diarization on top. That's another install, another dependency, and more to break. If you need who-said-what for an interview, budget for setup time on top of compute.
And when it breaks, you're the support desk. There's no one to email about a failed run, a bad export, or a GPU driver conflict. For a developer who enjoys the tooling, that's fine. For a researcher on a deadline, the hours lost to setup can cost more than a paid tool ever would.
What's the catch with free transcription tiers?
Free tiers are real, but capped hard. A widely used free plan limits you to 300 transcription minutes a month, 30 minutes per file, and three lifetime file imports, with export restricted to mp3 and txt. That's fine for a short clip. It falls apart the moment you have a two-hour interview, or a stack of them.
The caps that hurt most are the per-file limit and the export lock. A 30-minute ceiling per file means a one-hour interview won't upload whole. And if you can only export txt, you lose the timestamps and the DOCX or SRT you actually need for coding or captions.
Diarization is usually the first thing a free tier drops or throttles. Without reliable speaker labels, a multi-person recording turns into an undifferentiated wall of text, and you're back to re-listening to sort out who said what. For a solo voice memo that's fine. For two or more speakers, that re-listening is where your time goes.
Is paying a human ever the cheaper option?
Rarely on price, sometimes on stakes. Certified human transcription is metered by the page: U.S. federal courts cap an ordinary transcript at $4.40 per original page (effective October 2024). At that rate, a long recording runs into the hundreds of dollars, orders of magnitude above AI. You're paying for a human's judgment and a certified record.
AI's low price does come with an accuracy caveat worth knowing. A PNAS study of five major ASR systems (Koenecke et al., 2020) found an average word error rate of 0.35 for Black speakers versus 0.19 for white speakers, nearly double. Accent, audio quality, and jargon all move the number. Cheap output still needs a human read-through where it counts.
The real trade is how much correction each option leaves you. We break the full human-versus-AI cost, accuracy, and provenance picture down separately. The short version: AI plus a focused human cleanup is cheapest for most volume; a full human transcript is for when a certified, defensible record is the point.
So what's the cheapest reliable way to transcribe audio?
For real volume, pay-once AI is the cheapest option you can actually rely on. You skip the ~10 GB of VRAM and A100-class hardware Whisper's largest model expects, and you skip the monthly subscription that bills you between projects. Roughly a dollar an hour of audio, with speaker labels and real exports included, no setup required.
The pricing model matters as much as the rate. Pepys charges once and the credits never expire: one minute of audio is one credit, and unused credits sit there until your next project. For irregular, project-based work, that beats a subscription you forget to cancel and a free tier you outgrow in a week.
Is it the cheapest in absolute dollars? No – running Whisper yourself is, if your time is free and you enjoy the setup. Pay-once is the cheapest way to get a reliable, diarized, exportable transcript without becoming your own IT department. Match the method to what you're actually transcribing, and the cheap choice usually picks itself.