Pepys

Guide

What is speech-to-text?

A plain-English explainer of automatic speech recognition – what it is, how it works, and why the raw output isn't a finished transcript.

The short answer

Speech-to-text (STT), also called automatic speech recognition (ASR), is the automatic conversion of spoken audio into written words. Modern systems are trained neural models that learn the audio-to-text mapping directly. STT produces the raw text; a transcript is the finished, formatted artifact you edit, label by speaker, and cite. Accuracy depends heavily on your recording quality.

Speech-to-text, defined

Speech-to-text (STT) is the automatic conversion of spoken language into written text. The standard NLP textbook by Jurafsky and Martin frames the task of automatic speech recognition (ASR) as mapping a speech waveform to the right string of words. STT and ASR name the same thing: audio in, text out.

The two terms mostly split by context. Engineers and researchers say ASR; product pages and everyday users say speech-to-text. Either way, the job is narrow and specific – recognize the words that were spoken, in order, and write them down.

That narrowness matters. STT recognizes words. On its own, it does not decide who was speaking, add punctuation you'll be happy with, or strip filler. Those are separate jobs layered on top, which we'll get to.

How speech-to-text actually works

Modern speech-to-text runs on trained neural networks, not the older assembly of separate hand-built parts. For decades, ASR stitched together a distinct acoustic model, a pronunciation dictionary, and a language model. Today's systems learn the mapping from audio to text more directly, from very large amounts of example data.

Jurafsky and Martin group current methods into two families. The first is the encoder-decoder paradigm used by systems like OpenAI's Whisper, which was trained to predict huge amounts of audio transcripts scraped from the internet. The second is self-supervised speech models such as wav2vec 2.0 and HuBERT, which learn useful representations from raw audio first, then fine-tune on a smaller labeled set.

One thing STT does not do by itself: tell you who spoke when. That's a separate task called speaker diarization, often run as its own step alongside recognition. STT gives you the words; diarization attaches a speaker label to each stretch of audio.

Speech-to-text vs transcription

It helps to separate the technology from the thing you actually keep. Speech-to-text is the technology – the automated recognition step. A transcript is the artifact – the finished, readable file, usually with speakers labeled, timestamps attached, and the obvious errors cleaned up.

Put plainly: STT produces raw text; transcription is the whole job of turning that text into a document you can quote, cite, or caption from. The raw STT output is a first draft you edit into the deliverable. If you want to run recognition on your own file right now, that's what a speech-to-text converter does; turning the result into a polished, attributable transcript is the editing pass that follows.

The distinction sounds pedantic, but it matters. Confusing the two is why people expect a raw machine transcript to read like a finished document, then feel let down when it doesn't.

Where speech-to-text slips

Accuracy depends far more on your audio than on which model you pick. On clean, read-aloud speech the numbers look excellent; on real conversation they drop. Two people talking over each other, with accents and background noise, is genuinely the hard case for any recognizer.

The famous near-99% figures come from easy benchmarks. LibriSpeech, a widely used test set, is read English speech from public-domain audiobooks – people reading aloud in good conditions. On spontaneous conversational speech, the best systems and professional human transcribers both land near 5.9% word error rate on the Switchboard benchmark. A benchmark score is not a promise about your file. For the full picture, see how accurate AI transcription really is.

There's a subtler failure worth knowing. STT can produce fluent text that was never spoken. In one peer-reviewed audit, roughly 1% of Whisper transcriptions contained entirely invented phrases absent from the audio. Fluent output isn't the same as faithful output, which is exactly why you read the draft against the recording.

Treat the machine output as a fast first draft. Check the spots STT struggles with – names, jargon, numbers, crosstalk – against the audio, add speaker labels and timestamps, and you've turned raw recognition into a transcript you can stand behind. If your file is an interview, the full interview-transcription workflow walks through exactly that.

Tips from people who do this a lot

  • If a vendor quotes a single accuracy number, ask what audio it was measured on. A read-speech benchmark score won't hold up on a noisy three-person meeting.

  • STT and ASR are the same thing – don't let the two labels make you think a product is doing something extra. Speech-to-text is the recognition step, full stop.

  • Raw speech-to-text output has no reliable speaker labels. If you need who-said-what, look for diarization as a named, separate feature.

  • Re-read machine transcripts for invented phrases, not just misheard words. Fluent-but-fake lines are the ones that slip past a quick skim.

  • A polished transcript is editing work on top of recognition. Budget time for the cleanup pass, not just the upload.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

What is speech to text – questions, answered

What is speech-to-text?

Speech-to-text (STT) is the automatic conversion of spoken language into written text. It's the same task researchers call automatic speech recognition, or ASR: audio goes in, a string of words comes out. Modern STT runs on trained neural models rather than the older hand-built acoustic-and-language-model pipelines.

Is speech-to-text the same as transcription?

Not quite. Speech-to-text is the technology that recognizes spoken words; transcription is the finished artifact – a readable file with speakers labeled, timestamps added, and errors cleaned up. Raw STT output is a first draft. Turning it into a citable, publishable transcript takes an editing pass on top.

How accurate is speech-to-text?

It depends heavily on the audio. On clean, read-aloud benchmarks, accuracy looks near-perfect. On spontaneous conversation, the best systems and professional human transcribers both sit near 5.9% word error rate on the Switchboard test set. A published benchmark score is not a promise about your specific recording.

What's the difference between STT and ASR?

None, really. ASR (automatic speech recognition) is the technical term used in research and engineering; speech-to-text is the plain-language name used on product pages and in everyday use. Both describe the same job: mapping a speech waveform to the words that were spoken.

Can speech-to-text make things up?

Yes. Beyond simple mishearing, neural STT can output fluent phrases that were never spoken. In one peer-reviewed audit, about 1% of Whisper transcriptions contained entirely invented content absent from the audio. That's why you should read any machine transcript against the recording before quoting it.

References

  1. 1.Jurafsky & Martin, Speech and Language Processing (3rd ed. draft), Ch. 15 'Automatic Speech Recognition'Stanford University (jurafsky/slp3)
  2. 2.Radford et al. (2022), 'Reliable Speech Recognition via Large-Scale Weak Supervision' (Whisper)OpenAI / arXiv
  3. 3.Baevski, Zhou, Mohamed, Auli (2020), 'wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations'Facebook AI Research / arXiv
  4. 4.Park et al. (2022), 'A Review of Speaker Diarization: Recent Advances with Deep Learning'Computer Speech & Language (Elsevier) / arXiv
  5. 5.Xiong et al. (2016), 'Achieving Human Parity in Conversational Speech Recognition' (Switchboard)Microsoft Research / arXiv
  6. 6.Panayotov, Chen, Povey, Khudanpur (2015), 'LibriSpeech: an ASR corpus based on public domain audio books' (OpenSLR resource 12)OpenSLR / IEEE ICASSP 2015
  7. 7.Koenecke et al. (2024), 'Careless Whisper: Speech-to-Text Hallucination Harms', FAccT '24ACM FAccT 2024 (peer-reviewed)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.