Pepys

Transcription glossary

The words that come up around transcription and captions, each defined in a sentence or two. 36 terms.

#

16 kHz Downsampling
16 kHz downsampling means converting audio to a sample rate of 16,000 samples per second, the common standard for speech recognition. Because a sample rate captures frequencies up to half its value, 16 kHz preserves sound up to 8 kHz, which covers the range of the human voice with margin to spare. Higher-quality recordings are routinely reduced to 16 kHz before transcription with no meaningful loss of speech detail.See also: Best audio format for transcription

A

Acoustic Model
The acoustic model is the part of a speech recognition system that maps the raw audio signal to the basic units of speech sound, such as phonemes. It learns what different sounds look like in the audio so the system can guess which sounds were spoken. Its output is then combined with a language model to produce the final text.See also: What is speech-to-text?
ASR Hallucination
An ASR hallucination is text that a speech recognition system outputs even though those words were never spoken in the audio. It often appears during silence, background noise, or unclear speech, where the model invents plausible-sounding but fabricated phrases. Because the fake text reads fluently, hallucinations can be harder to spot than ordinary misheard words.See also: How accurate is AI transcription?
Audio description
Audio description is a spoken narration track that describes the key visual elements of a video, such as actions, settings, facial expressions, and on-screen text, for people who are blind or have low vision. The narration is inserted during natural pauses in the dialogue so it does not talk over what is being said. It is a separate accessibility feature from captions, which cover the audio rather than the visuals.See also: Transcription vs captions vs subtitles

B

Batch (offline) transcription
Batch transcription, also called offline transcription, processes a complete pre-recorded audio or video file all at once rather than live. Because the system can use the whole recording as context and take the time it needs, batch results are usually more accurate than real-time ones. It is the standard approach for interviews, podcasts, and any file you upload after recording.See also: Types of transcription

C

Caption
A caption is on-screen text that shows the spoken dialogue and often the relevant sounds in a video, timed to appear as they happen. Captions are written in the same language as the audio and are mainly meant for viewers who are watching without sound or who are deaf or hard of hearing.See also: Transcription vs captions vs subtitles
CAQDAS
CAQDAS stands for Computer-Assisted Qualitative Data Analysis Software, tools that help researchers organize, tag, and find patterns in unstructured material like interview transcripts, field notes, and open-ended survey answers. Well-known examples include NVivo, ATLAS.ti, and MAXQDA. Transcripts are frequently imported into these programs so researchers can code and analyze what people said.
CEA-608 and CEA-708
CEA-608 and CEA-708 are the two U.S. broadcast closed-caption standards. CEA-608, also called line 21 captions, is the older analog standard with fixed white text and limited formatting, while CEA-708 is the digital-television standard that adds fonts, colors, sizes, and positioning. A CEA-708 stream can carry CEA-608 data inside it for backward compatibility.See also: Caption accuracy standards
Clean Verbatim
Clean verbatim is a transcription style that keeps what the speaker meant while removing distractions like filler words (um, uh), false starts, repetitions, and stutters. The result reads more smoothly than full verbatim and is the most common choice for interviews, podcasts, and business content.See also: Verbatim vs clean verbatim
Closed Captions
Closed captions are captions the viewer can turn on or off, because the text is stored separately from the video rather than burned into the picture. The "closed" part means the captions stay hidden until someone chooses to display them.See also: Open vs closed captions
Confidence score
A confidence score is a number, usually between 0 and 1, that a speech-to-text system attaches to a word or segment to show how sure it is that the transcription is correct. Low scores flag places where the audio was unclear or the model was uncertain, so a human can review them first. A confidence score reflects the model's own certainty, not a guarantee that the word is actually right.See also: How accurate is AI transcription

D

Diarization Error Rate (DER)
Diarization error rate (DER) measures how well a system identifies who spoke and when, and a lower number is better. It is the diarization counterpart to word error rate, adding up the time wrongly labeled as speech (false alarm), the speech the system missed, and the speech assigned to the wrong speaker (confusion), divided by the total reference speech time.See also: What is speaker diarization?

F

Forced alignment
Forced alignment is the process of matching an existing transcript to its audio so each word (or sound) gets a precise start and end time. Software uses a speech model to line the known text up against the recording, producing timestamps without anyone marking them by hand. It is how tools turn a plain transcript into a timestamped one for captions and phonetic analysis.See also: What is a timestamped transcript

H

Hybrid Transcription
Hybrid transcription combines machine transcription with human review: software produces a first draft, then a person listens to the audio and corrects the errors. This approach aims for accuracy close to fully human transcription while being faster and cheaper than transcribing from scratch. It is a common middle ground for work where quality matters but budget is limited.See also: AI vs human transcription, Types of transcription

I

Intelligent Verbatim
Intelligent verbatim is another name for clean verbatim, a style that lightly edits speech for readability by dropping filler words, stammers, and non-essential repetition without changing the meaning. It sits between word-for-word full verbatim and a fully paraphrased summary.See also: Verbatim vs clean verbatim
IPA (International Phonetic Alphabet)
The International Phonetic Alphabet is a standardized set of symbols where each symbol represents one distinct speech sound, so a word can be written the same way regardless of its spelling or the language it comes from. It is the common notation used in phonetic transcription, dictionaries, and language learning to show exact pronunciation. For example, the "th" sound in "think" is written with the symbol θ.See also: Transcribe accented English

L

Language Model (in ASR)
In automatic speech recognition, the language model is the component that predicts how likely a sequence of words is, so the system can pick the most plausible wording when sounds are ambiguous. It captures grammar and common phrasing, which is why an ASR system will favor "recognize speech" over the similar-sounding "wreck a nice beach." It works alongside the acoustic model, which handles the raw sound.See also: What is speech-to-text?
Lossless Audio
Lossless audio is a way of storing or compressing sound that keeps every bit of the original recording, so nothing is thrown away. Formats such as WAV, FLAC, and ALAC are lossless, unlike MP3, which discards some data to shrink the file. Lossless files are larger but preserve the cleanest possible input for transcription.See also: Best audio format for transcription

M

Machine Transcription
Machine transcription is the automatic conversion of speech into text by software, with no human typing the words. It is done by automatic speech recognition (ASR) systems and is fast and inexpensive, though its accuracy varies with audio quality, accents, and background noise. It is the opposite of human transcription, where a person listens and types.See also: AI vs human transcription, Types of transcription

N

Non-speech elements
Non-speech elements are the sounds in a recording that are not spoken words but still carry meaning, such as laughter, applause, music, background noise, or a phone ringing. In verbatim transcripts and captions they are noted in brackets, like [laughter] or [music], so a reader understands what is happening in the audio. Clean transcripts usually leave most of them out.See also: Verbatim vs clean verbatim

O

Open Captions
Open captions are captions permanently burned into the video image, so they are always visible and cannot be switched off. Because they are part of the picture itself, every viewer sees them the same way regardless of the player or device.See also: Open vs closed captions

P

Phonetic transcription
Phonetic transcription writes down the actual sounds of speech rather than its standard spelling, capturing how words are pronounced. It is used in linguistics, language teaching, and accent study, and usually relies on a fixed set of sound symbols like the International Phonetic Alphabet. This differs from ordinary transcription, which records the words themselves in normal spelling.See also: Types of transcription

R

Real-time (streaming) transcription
Real-time transcription, also called streaming transcription, converts speech to text as it is being spoken, with only a short delay. It is used for live captions, meetings, and voice assistants, where the text has to appear right away. Because it cannot hear what comes next, it is generally a little less accurate than transcribing a finished recording.See also: What is speech-to-text

S

Sample Rate
Sample rate is the number of times per second that an audio signal is measured when it is converted to a digital file, expressed in hertz (Hz) or kilohertz (kHz). For example, CD audio uses 44,100 samples per second (44.1 kHz), while speech recognition often uses 16 kHz. A higher sample rate captures higher frequencies but produces a larger file.See also: Best audio format for transcription
SDH – Subtitles for the Deaf and Hard of Hearing
SDH is a subtitle track that combines the dialogue with the extra information deaf and hard-of-hearing viewers need, such as speaker names and descriptions of sound effects and music. It was created so that subtitle formats used on DVDs, Blu-ray, and streaming could carry the same detail that traditional closed captions provide.See also: Transcription vs captions vs subtitles
Speaker Diarization
Speaker diarization is the process of figuring out who spoke and when in an audio recording. It splits a conversation into segments labeled by speaker (Speaker 1, Speaker 2, and so on), so a transcript can show who said what even before anyone attaches real names.See also: What is speaker diarization?
Speech-to-Text (ASR)
Speech-to-text, also called automatic speech recognition (ASR), is the technology that turns spoken language into written text automatically. It is what powers voice assistants, live captions, and AI transcription tools, and it works without a human typing out the words.See also: What is speech-to-text?, How accurate is AI transcription?
SRT (SubRip)
SRT, short for SubRip, is one of the most widely supported caption and subtitle file formats. Each entry lists a number, a start and end time, and the text to show, and its timestamps use a comma before the milliseconds (for example, 00:00:01,200). SRT files have no header and carry only plain text with no styling.See also: SRT vs VTT, How to make an SRT file
Subtitle
A subtitle is on-screen text that shows what is being said in a video, usually assuming the viewer can hear the audio. Subtitles are often a translation of the dialogue into another language, which is the main way they differ from captions.See also: Transcription vs captions vs subtitles, Transcription vs translation

T

Timestamp
A timestamp is a time marker in a transcript that shows when a given word or passage was spoken, usually written as hours, minutes, and seconds. Timestamps let you jump back to the exact moment in the audio or video and are the basis for building captions and subtitles.See also: What is a timestamped transcript?
Transcript
A transcript is the written text of what was said in an audio or video recording. It captures the spoken words as a document you can read, search, or edit, and it may or may not include timestamps and speaker labels.See also: What is transcription?, What is a timestamped transcript?
Transcription
Transcription is the process of converting spoken audio or video into written text. The output is a document that records what was said, and depending on the style it may capture every word exactly or a cleaned-up version of the speech.See also: What is transcription?, Types of transcription

U

Utterance
An utterance is a single continuous stretch of speech from one speaker, usually bounded by pauses or a change of speaker. Transcription and diarization systems often break audio into utterances as their basic unit, attaching a speaker label and timestamps to each one. An utterance can be anything from a single word to a full sentence or more.See also: What is speaker diarization

V

Verbatim Transcription
Verbatim transcription captures speech exactly as it was said, word for word. Full or true verbatim goes further and also records filler words, false starts, stutters, and non-speech sounds like laughter, making it useful for legal, research, and qualitative work where every detail matters.See also: Verbatim vs clean verbatim, Types of transcription

W

WebVTT
WebVTT (Web Video Text Tracks) is a caption and subtitle format built for HTML5 video on the web. Every file begins with the line WEBVTT, and its timestamps use a period before the milliseconds (for example, 00:00:01.200), unlike SRT's comma. It also supports basic positioning and styling that SRT lacks.See also: SRT vs VTT
Word Error Rate (WER)
Word error rate (WER) is a standard measure of transcription accuracy, where a lower number is better. It is calculated as the number of substitutions, deletions, and insertions divided by the total number of words in the reference (correct) transcript, so a WER of 5% means about 1 word in 20 is wrong.See also: Word error rate explained, How accurate is AI transcription?

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.