Pepys

Guide

The cheapest way to transcribe audio

A straight cost breakdown for researchers and journalists weighing free DIY transcription against paying per minute – or paying once.

The short answer

The cheapest way to transcribe audio in raw dollars is open-source Whisper, which is free under an MIT license but needs your own GPU, setup, and time, and can't label speakers on its own. Free tiers cap you at short clips. For real volume, the cheapest reliable option is pay-once AI at roughly $1 an hour, with no subscription and no expiry.

What does 'cheapest' really mean for transcription?

Cheapest splits three ways: dollars, time, and quality – and the option that wins on price often loses on the other two. Transcribing by hand runs up to six hours of work per hour of audio (Haberl et al., 2023). So 'free' typing is the most expensive method the moment you value your own time. The real question is which method is cheapest for your volume and your accuracy bar.

Price also hides a quality cost. A cheap or free transcript that mislabels names, drops numbers, or garbles a quote isn't cheap. You pay it back in correction time, or worse, in a wrong published quote. Accuracy is part of the cost, so read any 'free' claim as free in dollars, billed elsewhere.

Here's the ranking, cheapest first: open-source Whisper you run yourself, then capped free tiers, then per-minute AI, then pay-once AI, then paid human transcription. Each step up buys you less setup, more reliability, or both. The right choice depends on how much audio you have and how much a mistake costs you.

Is free, open-source transcription the cheapest way to transcribe audio?

In raw dollars, yes. OpenAI's Whisper ships its code and model weights under the MIT license, so you can download it and transcribe locally for nothing. The catch is everything around the model. You supply the computer, the setup, and the support. Free software is not the same as free transcription.

Bigger Whisper models are compute-hungry. The largest model needs about 10 GB of VRAM, and its published speeds are benchmarked on an A100 GPU – data-center hardware most people don't own. On a laptop CPU, a long recording can take longer to process than the audio runs. Cheap in dollars, slow in wall-clock time.

Whisper also doesn't tell you who said what. Speaker labeling isn't in its feature list; you add it with a separate tool like WhisperX, which layers pyannote diarization on top. That's another install, another dependency, and more to break. If you need who-said-what for an interview, budget for setup time on top of compute.

And when it breaks, you're the support desk. There's no one to email about a failed run, a bad export, or a GPU driver conflict. For a developer who enjoys the tooling, that's fine. For a researcher on a deadline, the hours lost to setup can cost more than a paid tool ever would.

What's the catch with free transcription tiers?

Free tiers are real, but capped hard. A widely used free plan limits you to 300 transcription minutes a month, 30 minutes per file, and three lifetime file imports, with export restricted to mp3 and txt. That's fine for a short clip. It falls apart the moment you have a two-hour interview, or a stack of them.

The caps that hurt most are the per-file limit and the export lock. A 30-minute ceiling per file means a one-hour interview won't upload whole. And if you can only export txt, you lose the timestamps and the DOCX or SRT you actually need for coding or captions.

Diarization is usually the first thing a free tier drops or throttles. Without reliable speaker labels, a multi-person recording turns into an undifferentiated wall of text, and you're back to re-listening to sort out who said what. For a solo voice memo that's fine. For two or more speakers, that re-listening is where your time goes.

Is paying a human ever the cheaper option?

Rarely on price, sometimes on stakes. Certified human transcription is metered by the page: U.S. federal courts cap an ordinary transcript at $4.40 per original page (effective October 2024). At that rate, a long recording runs into the hundreds of dollars, orders of magnitude above AI. You're paying for a human's judgment and a certified record.

AI's low price does come with an accuracy caveat worth knowing. A PNAS study of five major ASR systems (Koenecke et al., 2020) found an average word error rate of 0.35 for Black speakers versus 0.19 for white speakers, nearly double. Accent, audio quality, and jargon all move the number. Cheap output still needs a human read-through where it counts.

The real trade is how much correction each option leaves you. We break the full human-versus-AI cost, accuracy, and provenance picture down separately. The short version: AI plus a focused human cleanup is cheapest for most volume; a full human transcript is for when a certified, defensible record is the point.

So what's the cheapest reliable way to transcribe audio?

For real volume, pay-once AI is the cheapest option you can actually rely on. You skip the ~10 GB of VRAM and A100-class hardware Whisper's largest model expects, and you skip the monthly subscription that bills you between projects. Roughly a dollar an hour of audio, with speaker labels and real exports included, no setup required.

The pricing model matters as much as the rate. Pepys charges once and the credits never expire: one minute of audio is one credit, and unused credits sit there until your next project. For irregular, project-based work, that beats a subscription you forget to cancel and a free tier you outgrow in a week.

Is it the cheapest in absolute dollars? No – running Whisper yourself is, if your time is free and you enjoy the setup. Pay-once is the cheapest way to get a reliable, diarized, exportable transcript without becoming your own IT department. Match the method to what you're actually transcribing, and the cheap choice usually picks itself.

Tips from people who do this a lot

  • Before you commit to running Whisper locally, time a five-minute clip on your actual machine. If a CPU transcribe takes longer than the audio, a paid tool is cheaper than your afternoon.

  • Check the export list before you check the minute cap. A free tier that only outputs txt strips your timestamps, which is the thing you need most for citation and coding.

  • If your recording has two or more speakers, price in diarization from the start. Free and DIY options usually make you bolt it on, and that's where the hidden time goes.

  • Count your annual volume, not your monthly. Irregular, project-based work almost always costs less on pay-once credits than on a subscription that idles between deadlines.

  • For anything that might be quoted or contested, budget a human read-through of the AI draft. Error rates climb with accent, crosstalk, and jargon, exactly where accuracy matters.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

Cheapest way to transcribe audio – questions, answered

What is the cheapest way to transcribe audio?

In raw dollars, open-source Whisper is cheapest. It's free under an MIT license and runs locally. But it needs your own GPU, setup, and time, and can't label speakers by itself. For reliable, diarized transcripts at real volume, pay-once AI at about a dollar an hour is the cheapest dependable option.

Is free transcription software any good?

For a clean, single-speaker clip, yes. Open-source Whisper is accurate for its price. The limits show up with volume and speakers: its largest model wants about 10 GB of VRAM, it has no built-in diarization, and there's no support when a run fails. Free software still costs setup and hardware time.

Why not just use a free online tier?

Free tiers work until you hit the caps. A common one allows 300 minutes a month, 30 minutes per file, and three lifetime imports, exporting only mp3 and txt. That's fine for a short memo. A two-hour interview won't fit, and you lose the timestamps and DOCX or SRT real work needs.

Is AI transcription accurate enough to trust the cheap option?

Usually, with a caveat. A 2020 PNAS study of five major ASR systems found word error rates of 0.35 for Black speakers versus 0.19 for white speakers, nearly double. Accent, audio quality, and jargon all matter. Treat a cheap AI transcript as a strong first draft, then read the quotes that count.

How much does human transcription cost by comparison?

Far more. U.S. federal courts cap a certified ordinary transcript at $4.40 per page, so a long recording runs into the hundreds of dollars, orders of magnitude above AI. Humans are worth it when you need a certified, defensible record, not routine speed at the lowest price.

References

  1. 1.Whisper – code and model weights released under the MIT License (VRAM/model table; A100-benchmarked speeds)OpenAI (official GitHub repo)
  2. 2.WhisperX – multispeaker diarization added to Whisper via pyannote-audioWhisperX (m-bain, official repo)
  3. 3.Haberl et al. (2023), Take the aTrain – up to six hours of manual work per hour of audio, citing Bell et al. (2018)Behavior Research Methods / arXiv:2310.11967
  4. 4.Koenecke et al. (2020), Racial disparities in automated speech recognition (primary study)Proceedings of the National Academy of Sciences (PNAS)
  5. 5.Word error rate 0.35 for Black speakers vs 0.19 for white speakers (accessible confirming release)EurekAlert! / AAAS (March 23, 2020)
  6. 6.Free-plan transcription limits (300 min/month, 30 min/file, 3 lifetime imports, mp3/txt export), verified 2026-07-04Otter.ai (official pricing page)
  7. 7.Maximum Transcript Rates – ordinary (30-day) original transcript $4.40 per page, effective October 1, 2024U.S. District Court for the District of Columbia (dcd.uscourts.gov)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.