What 'accurate captions' actually means
Under the FCC's accuracy standard, captions must match the spoken words in the order spoken, without substituting words for proper names and places. They also have to convey nonverbal information: speaker identity, the presence of music, sound effects, and audience reaction (47 CFR 79.1(j)(2)(i)). So "accurate" means more than a clean word-for-word match.
A caption file is more than a transcript with timecodes. The accuracy standard treats a missing speaker label or an unmarked laugh as a defect, because a deaf or hard-of-hearing viewer depends on captions for everything the audio carries. Getting the words right but dropping a [MUSIC] cue still misses the point.
This is why raw speech-to-text output isn't automatically caption-grade. A model can nail the transcript and still miss speaker identity, sound cues, and the on-screen timing captions need. For the mechanics of scoring the words alone, see how word error rate is calculated.
The FCC's four caption accuracy standards
US caption rules rest on four quality standards, set out in a fixed order: accuracy, synchronicity, completeness, and placement (47 CFR 79.1(j)(2)). They were codified from the FCC's 2014 Closed Captioning Quality Report and Order and govern captioned TV programming.
Each standard covers a distinct failure. Accuracy is the words plus non-speech information. Synchronicity means captions line up with the audio and hold on screen long enough to read. Completeness means captions run from the start of a program to the end. Placement means captions don't block faces, mouths, or other on-screen text.
These are the caption accuracy standards most US broadcasters and programmers work against. They're standards, not one pass/fail number: the FCC weighs them together and, as the next sections show, applies them differently to live versus prerecorded content.
How caption accuracy gets measured
There's no universal legal percentage for caption accuracy. The DCMP Captioning Key, a widely used quality reference, sets no numeric threshold at all – it states that "errorless captions are the goal for each production." The one figure DCMP hosts, 98% or better, sits on its real-time captioning page and is attributed to captioning companies, not set by DCMP.
That distinction matters. A "99% accuracy" figure gets repeated as if it were an official caption standard. It isn't DCMP's. DCMP's own guidance asks for errorless captions and, for live work, cites the 98%-or-better rate vendors typically set. Offline captions are usually scored with word error rate; live captions use a different model, covered next.
Real-world speech-to-text rarely reaches these bars on messy audio. Overlap, accents, background noise, and unfamiliar names all drag accuracy down, which is why an AI first pass plus human cleanup beats either alone. For a grounded look at the numbers, see how accurate AI transcription really is.
Reading speed counts as accuracy too. Captions that flash past faster than a viewer can read fail even when every word is correct. The DCMP Captioning Key's presentation-rate guidance caps captions at 130 words per minute for lower-level material, 140 for middle-level, and 160 for upper-level.
Why are live captions judged by a different standard?
Because you can't proofread speech as it happens. The FCC's rules define live, near-live, and prerecorded programming separately, and apply the quality standards differently to live and near-live content than to prerecorded content (47 CFR 79.1). A stenographer or re-speaker working in real time can't hit an offline editor's word-perfect bar.
So live captions get their own metric. Instead of plain word error rate, live subtitling quality is scored with a severity-weighted model called NER, developed by Pablo Romero-Fresco. It weights errors by how much they distort meaning, not by counting every slip equally (Romero-Fresco & Pöchhacker; Wolk & Korzinek).
Regulators pair accuracy with timing. The UK's Ofcom measures live subtitling on three things: average speed, average latency – the delay between speech and subtitle – and the number and severity of errors. The University of Roehampton validates the measurements (Ofcom). Speed and latency are separate metrics, not part of the NER score.
WCAG, Section 508, and the legal baseline
Outside broadcast TV, caption requirements usually flow through web accessibility law. WCAG 2.1 makes captions for prerecorded video a Level A requirement (SC 1.2.2) – the minimum conformance level – and captions for live video a Level AA requirement (SC 1.2.4).
US federal agencies inherit those rules through Section 508. The revised standards incorporate WCAG 2.0 Level AA by reference and apply it to both web and non-web electronic content. For a government site or a federal contractor, "accessible video" means captions meeting the WCAG AA bar.
For internet-delivered video, the FCC's IP-captioning rule (47 CFR 79.4) grew out of the 2010 Twenty-First Century Communications and Video Accessibility Act, which extended captioning to programming shown online after it aired on TV. That's background, not legal advice: which rules bind you depends on your content and audience.
Whichever rule binds you, the criteria converge. Accurate captions carry every word in order, name the speakers, mark the non-speech sounds, stay in sync, run start to finish, and read slowly enough to follow. No single percentage makes a caption "compliant." For your own video, start from a clean, time-synced file and check it against the audio – you can export a time-synced SRT caption file, then correct names, sound cues, and timing by hand.
This explainer stops at the standards. For the production procedure – attaching or burning in the file, syncing it, and exporting for a specific platform – see how to add subtitles to a video.