https://medium.com/searce/generate-srt- ... 2b2f1da3bd
Generate SRT File (Subtitles) using Google Cloud’s Speech-to-Text API - by Darshan Majithiya
If you followed the mentioned blog, you would realize that in the age of automation, creating an SRT file involves a lot of manual labor. Can we somehow minimize these efforts?
Use pre-trained APIs — Now there are multiple pre-trained APIs that can do this job efficiently. Added benefits are — they require less time to set up, easy to learn, and are cost-efficient. We will use one such API to generate subtitles — Google Cloud’s Speech-to-Text API.
Let’s Get Started!
Pre-requisites
You need to have Git, Python 3.7 and ffmpeg installed on your system.
You need to have a Google Cloud project with billing enabled. Follow Creating and managing projects to set this up.
Also, a service account with the right to use Speech-to-Text API. Download the service account credentials as credentials.json. Follow Creating and managing service accounts to set this up.
Setting Up the Environment
Enable the Speech-to-Text API in your Google Cloud Project. From the navigation bar, go to APIs & Services > Library > Cloud Speech-to-Text API and Click on Enable.
Now, run below commands from your Terminal
Clone the repository —
git clone git@github.com:darshan-majithiya/Generate-SRT-File-using-Google-Cloud-s-Speech-to-Text-API.git
Install the requirements —
cd Generate-SRT-File-using-Google-Cloud-s-Speech-to-Text-API
pip install -r requirements.txt
Move your credentials.json here and then export the credentials —
export GOOGLE_APPLICATION_CREDENTIALS="credentials.json"
Data Preparation
I’m a Suits fan so I’ll use this video for the demonstration. Feel free to use any other video.
I’ll download this video using the pytube3 module.
Now, that the video is downloaded, let’s move ahead and get the various attributes of this video.
Getting the Number of Channels, Bit Rate, and Sample Rate of the Video
We need these attribute values to transform video to audio which will later be used by Speech-to-Text API.
Channels — it’s the passage or communication channel in which a sound signal is transported.
Sampling Rate — defines how many times per second a sound is sampled.
Bit Rate — is the number of bits encoded per second or number of bits transmitted/received per second. Higher the bit rate with a higher sampling rate implies good quality audio.
I’ve used pydub module to extract these attributes.
Converting Video to Audio & Upload to GCS
Transform the video to audio to be used by Speech-to-Text API and storing it on GCS because for audio > 1 min and size ≥ 10 MB, API requires the audio to be stored in the bucket.
Transcribing the Audio
Configuration
Before diving into transcribing the audio, let’s talk about the configuration required.
config = {
"language_code": "en-US",
"sample_rate_hertz": int(sample_rate),
"encoding": enums.RecognitionConfig.AudioEncoding.LINEAR16,
"audio_channel_count": int(channels),
"enable_word_time_offsets": True,
"model": "video",
"enable_automatic_punctuation":True
}
language_code — The language used in your video/audio. You can check all the supported languages here.
sample_rate_hertz — Sample rate of the video/audio which we extracted using pydub module.
encoding — Speech-to-Text API only supports a specific type of audio encodings. You can find all the supported encodings here.
audio_channel_count — The number of channels used by video/audio.
enable_word_time_offsets — If True, gives start time and end time of each word.
model — The model that will be used by API for transcription. Our original Source is a Video, so we’ve used a “video” model. But there are various other models available, you can check them here.
enable_automatic_punctuation — If True, it also tries to detect the punctuations.
Other then all these, there are many other features available such as Speaker Diarization, Word level confidence, Separate Transcription for each channel, etc. Speech-to-Text API also supports transcription for live streaming.
Transcribe
The below long_running_recognize function transcribes the video file and returns response object which contents transcripts, confidence, words, start_time & end_time for each word.
Code: Select all
def long_running_recognize(storage_uri, channels, sample_rate):
client = speech_v1.SpeechClient()
config = {
"language_code": "en-US",
"sample_rate_hertz": int(sample_rate),
"encoding": enums.RecognitionConfig.AudioEncoding.LINEAR16,
"audio_channel_count": int(channels),
"enable_word_time_offsets": True,
"model": "video",
"enable_automatic_punctuation":True
}
audio = {"uri": storage_uri}
operation = client.long_running_recognize(config, audio)
print(u"Waiting for operation to complete...")
response = operation.result()
return response
For the generation of SRT File, I’ve used srt module. Below code converts the response object from Speech-to-Text API into SRT format string. The code looks a bit complicated and it is because each variable is required to display subtitles in sync with the audio when using it in a media player.
Code: Select all
def subtitle_generation(speech_to_text_response, bin_size=3):
"""We define a bin of time period to display the words in sync with audio.
Here, bin_size = 3 means each bin is of 3 secs.
All the words in the interval of 3 secs in result will be grouped togather."""
transcriptions = []
index = 0
for result in response.results:
try:
if result.alternatives[0].words[0].start_time.seconds:
# bin start -> for first word of result
start_sec = result.alternatives[0].words[0].start_time.seconds
start_microsec = result.alternatives[0].words[0].start_time.nanos * 0.001
else:
# bin start -> For First word of response
start_sec = 0
start_microsec = 0
end_sec = start_sec + bin_size # bin end sec
# for last word of result
last_word_end_sec = result.alternatives[0].words[-1].end_time.seconds
last_word_end_microsec = result.alternatives[0].words[-1].end_time.nanos * 0.001
# bin transcript
transcript = result.alternatives[0].words[0].word
index += 1 # subtitle index
for i in range(len(result.alternatives[0].words) - 1):
try:
word = result.alternatives[0].words[i + 1].word
word_start_sec = result.alternatives[0].words[i + 1].start_time.seconds
word_start_microsec = result.alternatives[0].words[i + 1].start_time.nanos * 0.001 # 0.001 to convert nana -> micro
word_end_sec = result.alternatives[0].words[i + 1].end_time.seconds
word_end_microsec = result.alternatives[0].words[i + 1].end_time.nanos * 0.001
if word_end_sec < end_sec:
transcript = transcript + " " + word
else:
previous_word_end_sec = result.alternatives[0].words[i].end_time.seconds
previous_word_end_microsec = result.alternatives[0].words[i].end_time.nanos * 0.001
# append bin transcript
transcriptions.append(srt.Subtitle(index, datetime.timedelta(0, start_sec, start_microsec), datetime.timedelta(0, previous_word_end_sec, previous_word_end_microsec), transcript))
# reset bin parameters
start_sec = word_start_sec
start_microsec = word_start_microsec
end_sec = start_sec + bin_size
transcript = result.alternatives[0].words[i + 1].word
index += 1
except IndexError:
pass
# append transcript of last transcript in bin
transcriptions.append(srt.Subtitle(index, datetime.timedelta(0, start_sec, start_microsec), datetime.timedelta(0, last_word_end_sec, last_word_end_microsec), transcript))
index += 1
except IndexError:
pass
# turn transcription list into subtitles
subtitles = srt.compose(transcriptions)
return subtitles
You can then save the subtitles as—
Code: Select all
with open("subtitles.srt", "w") as f:
f.write(subtitles)
https://github.com/darshan-majithiya/Ge ... o-Text-API
Conclusion
Whether you want to perform Speaker Diarization Or want to transcribe customer calls for performance analysis of your employees Or maybe generate subtitles for your live streaming for viewers — You can rely on Google Cloud’s Speech-to-Text API to do it for you in an efficient way with minimum efforts.