Pepys

Guide

What is a timestamped transcript?

What the time codes in a transcript actually mean, how segment- and word-level timing differ, and when they're worth having – for anyone who cites, captions, or verifies audio.

The short answer

A timestamped transcript is an ordinary transcript with one addition: every segment or word is tagged with the exact time it occurs in the audio, written as hours:minutes:seconds – for example, 00:02:17,440. Those time codes let you jump back to any line, cite the precise moment of a quote, and turn plain text into synchronized captions.

A timestamped transcript, defined

A timestamped transcript is an ordinary transcript with one thing added: every chunk of text carries the time it was spoken. Instead of a flat wall of words, you get lines anchored to the clock: 00:02:17 here, 00:05:41 there. Any passage points straight back to its moment in the recording.

The time codes are written as hours, minutes, and seconds, often down to the millisecond. A cue like 00:02:17,440 means two minutes, seventeen seconds, and 440 thousandths into the file. How finely a transcript is stamped varies: some mark every speaker turn, some every sentence, some every single word.

This page is about what the artifact is and when it's worth having. If you already know you want one and just need to make it, generate a transcript with timestamps instead. The rest of this guide is the why, not the how.

Segment-level and word-level timestamps

Timestamps come in two standard grains, and they answer different questions. OpenAI's speech-to-text API exposes exactly two timestamp granularities – segment and word – and that split is the one most tools follow. Segment timing marks each block of speech; word timing marks each individual word.

Word-level timing is the opt-in layer. In the OpenAI reference, the two supported granularity values are word and segment, and word timings only appear when you ask for them: set the response format to verbose_json and request word granularity. Segment-level timing is the readily available grain; word-level is the one you switch on.

For most work, segment-level is enough. You want the answer that starts around 14:30, not the millisecond edge of the word 'budget.' Word-level timing earns its cost in captions, in karaoke-style highlighting, and when a specific word has to line up with video. Match the grain to the job.

Why native timestamps drift, and how alignment fixes them

The timestamps a model prints aren't automatically accurate. The peer-reviewed WhisperX paper (Interspeech 2023) reports that Whisper's utterance timestamps are prone to inaccuracy and that word-level timestamps aren't available out of the box. The clock the decoder guesses at and the clock the audio follows can drift apart.

The fix is forced alignment. The Montreal Forced Aligner defines it as taking a known transcript and generating a time-aligned version, using a pronunciation dictionary to look up the phones for each word. Plainly: you already have the words, so you align them to the waveform to recover accurate start and end times.

WhisperX pairs voice-activity detection with forced phoneme alignment to produce time-accurate word-level timestamps, reporting strong word-segmentation results. If word-level precision matters to you, ask how the timestamps were produced. Decoder-native timing and alignment-corrected timing aren't the same thing, even when both get labeled 'word-level.'

Timestamps are what turn a transcript into captions

Captions are a timestamped transcript in a specific file format – nothing more exotic than that. And they aren't optional: WCAG 2.1 Success Criterion 1.2.2 makes captions for prerecorded audio a Level A requirement, the baseline conformance bar. Without accurate time codes, text can't stay in sync with speech, so it can't work as captions at all.

The two common caption formats time their cues almost identically, with one giveaway difference. SubRip (SRT) writes a cue as 00:02:17,440, with a comma before the milliseconds – the form shown in Matroska's subtitle documentation. WebVTT, the web caption format specified by the W3C, uses a period instead: 00:02:17.440. Same timing, comma versus full stop.

Building a clean caption file is its own craft. Cue length and reading speed matter as much as your line breaks. If that's your goal, how to make an SRT file covers the block structure and the timing rules. For this page, the point is narrower: captions exist because the transcript underneath them is timestamped.

When does a timestamped transcript actually matter?

Timestamps matter most when you have to prove a quote is real. APA Style tells writers to give a time stamp in place of a page number when directly quoting an audiovisual work. The Chicago Manual of Style advises the same: a time stamp works like a page number, pointing a reader to the exact moment.

A time code is an audit trail. Transcription is never neutral: Oliver, Serovich and Mason call it a powerful act of representation that can shape which conclusions get drawn. Being able to jump back to the source line and hear it in context is how you keep the text honest. That's why the interview transcription workflow treats timestamps as a fact-checking tool, not a formatting nicety.

Timestamps also make long recordings navigable, and they pair naturally with speaker labels. 'Who said what, when' is two questions stacked: speaker diarization answers the who, and the timestamps answer the when. Together they let you index a three-hour recording and jump to the line you need instead of scrubbing for it.

Tips from people who do this a lot

  • Ask whether a tool's word-level timestamps are decoder-native or alignment-corrected. For syncing single words to video, the aligned kind is the one you want.

  • Segment-level timing is plenty for citation and navigation. Don't pay for word-level unless you're building captions or highlighting words.

  • If you export SRT and the player shows nothing, check the milliseconds separator: SRT wants a comma (00:02:17,440), WebVTT wants a period (00:02:17.440).

  • Keep timestamps on the master transcript even after you export a clean copy. They're your fastest route back to the audio when a quote gets challenged.

  • When citing audio in APA or Chicago, note the start time of the quotation, not the whole segment's range. Both styles want the exact moment.

Try it now

Drop in your recording or paste a link and get a clean, speaker-labeled transcript in minutes. Your first 60 minutes are free.

or paste a link
InstagramTikTokYouTubeFacebookSpotifyApple Podcasts

60 min free · no card required · we never train on your audio

PodcasterJournalistContent creatorResearcherStudent
Trusted by 100,000+ creators, podcasters, journalists & researchers

Timestamped transcript – questions, answered

What does a timestamp in a transcript look like?

It's a clock reading in hours, minutes, and seconds, often down to the millisecond. For example, 00:02:17,440 means two minutes and about 17.4 seconds into the audio. SubRip (SRT) puts a comma before the milliseconds; WebVTT uses a period. Both mark the same moment in the recording.

What's the difference between segment and word-level timestamps?

Segment-level timing stamps each block of speech, like a sentence or a speaker turn. Word-level timing stamps every individual word. Segment is enough for citation and navigation; word-level is the opt-in layer you turn on for captions or lining a single word up with video. OpenAI's API supports exactly these two grains.

Are automatic timestamps accurate?

Not always. The peer-reviewed WhisperX paper found Whisper's native utterance timestamps prone to inaccuracy, with word-level timing missing out of the box. Forced alignment, which maps a known transcript back onto the audio, corrects this and produces time-accurate word-level timestamps. If precision matters, ask how the timings were made.

Do I need timestamps to make captions?

Yes. Caption files like SRT and WebVTT are timestamped transcripts in a set format, and the time codes are what keep text in sync with speech. Captions for prerecorded audio are also a WCAG 2.1 Level A requirement, so accurate timing is the accessibility baseline, not a nicety.

How do I cite a quote from audio or video?

Use a timestamp in place of a page number. APA Style tells writers to give the time stamp at the start of the quotation when quoting an audiovisual work, and the Chicago Manual of Style advises the same. The time code points your reader to the exact moment, so they can verify the line themselves.

References

  1. 1.Speech to text guide – timestamp granularities (segment, word, or both)OpenAI (official API docs)
  2. 2.Create transcription API reference – supported granularity values are word and segmentOpenAI
  3. 3.Bain, Huh, Han & Zisserman (2023), WhisperX: Time-Accurate Speech Transcription of Long-Form AudioProc. Interspeech 2023 (ISCA Archive)
  4. 4.Forced alignment, defined – User GuideMontreal Forced Aligner documentation
  5. 5.Direct quotation of material without page numbers – time stamp for audiovisual worksAPA Style (official)
  6. 6.Time stamps when citing a film or recordingThe Chicago Manual of Style Online
  7. 7.Technical: Subtitles – SubRip/SRT cue timing (HH:MM:SS,mmm)Matroska.org
  8. 8.WebVTT: The Web Video Text Tracks Format – fractional-seconds separatorW3C
  9. 9.Understanding SC 1.2.2: Captions (Prerecorded) – Level AW3C WAI
  10. 10.Oliver, Serovich & Mason (2005), Constraints and Opportunities with Interview TranscriptionSocial Forces (Oxford University Press)

Keep reading

Don't just take our word for it.

Ask ChatGPT, Claude, or Perplexity what Pepys is and who it's for. One click, and your favorite AI does the homework.

Get your transcript – free to start

Pay as you go – credits never expire, nothing to cancel. Or start free with 60 minutes, no card.