Generate WebVTT and SRT Captions Automatically with Node.js
in Tutorials
November 15, 2021Share
In this blog post
Providing captions for audio and video isn't just a nice-to-have - it's critical for accessibility. While this isn't specifically an accessibility post, I wanted to start by sharing Microsoft's Inclusive Toolkit. Something I hadn't considered before reading this was the impact of situational limitations. To learn more, jump to Section 3 of the toolkit - "Solve for one, extend to many". Having a young (read "loud") child, I've become even more aware of where captions are available, and if they aren't, I simply can't watch something with her around.
There are two common and similar caption formats we are going to generate today - WebVTT and SRT. A WebVTT file looks like this:
WEBVTT
1
00:00:00.219 --> 00:00:03.512
- yeah, as much as it's worth celebrating
2
00:00:04.569 --> 00:00:06.226
- the first space walk
3
00:00:06.564 --> 00:00:07.942
- with an all female team
4
00:00:08.615 --> 00:00:09.795
- I think many of us
5
00:00:10.135 --> 00:00:13.355
- are looking forward to it just being normal.
And a SRT file looks like this:
1
00:00:00,219 --> 00:00:03,512
yeah, as much as it's worth celebrating
2
00:00:04,569 --> 00:00:06,226
the first space walk
3
00:00:06,564 --> 00:00:07,942
with an all female team
4
00:00:08,615 --> 00:00:09,795
I think many of us
5
00:00:10,135 --> 00:00:13,355
are looking forward to it just being normal.
Both are very similar in their basic forms, except for the millisecond separator being .
in WebVTT and ,
in SRT. In this post, we will generate them manually from a Deepgram transcription result to see the technique, and then use the brand new Node.js SDK methods (available from v1.1.0) to make it even easier.
Before We Start
You will need:
Node.js installed on your machine - download it here.
A Deepgram API Key - get one here.
A hosted audio file URL to transcribe - you can use https://static.deepgram.com/examples/deep-learning-podcast-clip.wav if you don't have one.
Create a new directory and navigate to it with your terminal. Run npm init -y
to create a package.json
file and then install the Deepgram Node.js SDK with npm install @deepgram/sdk
.
Set Up Dependencies
Create an index.js
file, open it in your code editor, and require then initialize the dependencies:
const fs = require('fs')
const { Deepgram } = require('@deepgram/sdk')
const deepgram = new Deepgram('YOUR_API_KEY')
Get Transcript
To be given timestamps of phrases to include in our caption files, you need to ask Deepgram to include utterances (a chain of words or, more simply, a phrase).
deepgram.transcription
.preRecorded(
{
url: 'https://static.deepgram.com/examples/deep-learning-podcast-clip.wav',
},
{ punctuate: true, utterances: true }
)
.then((response) => {
// Following code here
})
.catch((error) => {
console.log({ error })
})
Create a Write Stream
Once you open a writable stream, you can insert text directly into your file. When you do this, pass in the a
flag, and any time you write data to the stream, it will be appended to the end. Inside of the .then()
block:
// WebVTT Filename
const stream = fs.createWriteStream('output.vtt', { flags: 'a' })
// SRT Filename
const stream = fs.createWriteStream('output.srt', { flags: 'a' })
Write Captions
The WebVTT and SRT formats are very similar, and each requires a block of text per utterance.
WebVTT
stream.write('WEBVTT\n\n')
for (let i = 0; i < response.results.utterances.length; i++) {
const utterance = response.results.utterances[i]
const start = new Date(utterance.start * 1000).toISOString().substr(11, 12)
const end = new Date(utterance.end * 1000).toISOString().substr(11, 12)
stream.write(`${i + 1}\n${start} --> ${end}\n- ${utterance.transcript}\n\n`)
}
Deepgram provides seconds back as a number (15.4
means 15.4 seconds), but both formats require times as HH:MM:SS.milliseconds
and getting the end of a Date().toISOString()
will achieve this for us.
Using the SDK
Replace the above code with this single line:
stream.write(response.toWebVTT())
SRT
for (let i = 0; i < response.results.utterances.length; i++) {
const utterance = response.results.utterances[i]
const start = new Date(utterance.start * 1000)
.toISOString()
.substr(11, 12)
.replace('.', ',')
const end = new Date(utterance.end * 1000)
.toISOString()
.substr(11, 12)
.replace('.', ',')
stream.write(`${i + 1}\n${start} --> ${end}\n${utterance.transcript}\n\n`)
}
Differences? No WEBVTT
line at the top, millisecond separator is ,
, and no -
before the utterance.
Using the SDK
Replace the above code with this single line:
stream.write(response.toSRT())
One Line to Captions
We actually implemented .toWebVTT()
and .toSRT()
straight into the Node.js SDK while writing this post. Now, it's easier than ever to create valid caption files automatically with Deepgram. If you have any questions, please feel free to reach out on Twitter - we're @DeepgramAI.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .
More with these tags:
Share your feedback
Was this article useful or interesting to you?
Thank you!
We appreciate your response.