What's actually inside an SRT file?
An SRT file is plain text, and every caption is a block of four parts: a sequence number, a timecode line, one or two lines of text, then a blank line. The timecode is the part people get wrong. It reads `00:02:17,440 --> 00:02:20,375` – note the comma before the three-digit millisecond field, not a period. Hours, minutes, and seconds are always two digits; milliseconds are always three.
There is no formal standard behind any of this. SubRip's format is the most basic of all subtitle formats – a de-facto convention, not a spec published by a standards body. That's exactly why it plays almost everywhere. The trade-off: no official rulebook means edge cases (styling, positioning) simply aren't defined, so keep the text plain.
One detail that saves you from garbled output: save the file as UTF-8. SRT text is carried as UTF-8 when stored properly, so if your captions contain accented or non-Latin characters, the wrong encoding turns them into mojibake. Plain ASCII survives anything; the moment you have an é or a ü, encoding matters.
Type it by hand or export it from a transcript?
You can build an SRT in any plain-text editor. Open Notepad or TextEdit (in plain-text mode), write your blocks, and save as filename.srt with UTF-8 encoding. Number the cues 1, 2, 3, put the timecode on its own line with the ` --> ` arrow, add your text, then leave a blank line. That's the whole format. For a two-minute clip, hand-typing is fine.
For anything longer, typing timecodes by hand is the slow part – you're pausing, scrubbing, and copying numbers for every line. The faster path is to start from a timestamped transcript and let the timings come from the audio. Upload your recording and export straight to SRT, then read the draft against the audio and fix the spots that matter. If you also need an editable document, export the same transcript to DOCX.
Whichever route you take, the cleanup work is the same one you'd do on any interview or recording transcript: correct names and jargon, then adjust where each cue starts and ends so it tracks the speech. The machine gets you a structurally valid file in minutes; your attention goes to timing and readability, not to typing brackets and commas.
What are the timing and line rules for readable captions?
Keep each caption to two lines. The Captioning Key from the DCMP states it plainly: no more than two lines per caption. A third line pushes text over the video and gives the reader too much to catch before the cue changes. If a sentence won't fit in two lines, split it across two consecutive cues instead of cramming.
Line length has a practical ceiling too. Netflix's timed-text style guide caps lines at 42 characters per line for English. Stay near or under that and captions won't get clipped on narrow players or crowd the frame. Longer than that and you're betting on the player wrapping gracefully, which it often won't.
Reading speed is what really determines timing. Netflix sets a ceiling of 20 characters per second for adult programs and 17 for children's. The DCMP frames the same limit in words per minute: 130 wpm for lower-level, 140 for middle, and 160 for upper-level material. If a cue flashes faster than that, hold it longer or trim the text.
Should you make an SRT or a WebVTT file?
Make an SRT for maximum compatibility; make a WebVTT (.vtt) for the web and for styling. The formats look almost identical, but WebVTT is published by the W3C on the Recommendation track – an actual specification, unlike SRT. The most visible difference in the timecode: WebVTT uses a full stop (period) before the thousandths field, where SRT uses a comma.
WebVTT also does things SRT can't. It supports styling and positioning through CSS, targeting cues with the ::cue pseudo-elements so you can set fonts, colours, and where text sits on screen. SRT has no defined way to do any of that. If you need captions to look a certain way in an HTML5 player, export to VTT rather than fighting SRT's plain-text limits.
For most uploads – YouTube, Vimeo, editing suites, social platforms – SRT is still the safe default, because everything reads it. Reach for WebVTT when your target is a web player you control and you care about presentation. Converting between them is trivial, since the block structure is the same and only the millisecond separator and header differ.
Why make an SRT file at all?
For a lot of published video, captions are an accessibility requirement, not a nicety. The W3C's WCAG 2.1 sets captions for prerecorded synchronized media as a Level A success criterion (SC 1.2.2), the baseline conformance level. If your video has audio and you're publishing it, an accurate caption file is part of meeting that bar.
Beyond compliance, an SRT is a small, portable artifact you own. It's searchable text tied to exact timecodes, so it doubles as a way to index a video and pull quotes from the spoken content. Because it's plain text, you can move it between platforms and keep it long after the tool that made it is gone.