Use Google API to create subtitles

emuler · Post by **emuler** » Sat Nov 14, 2020 7:30 am 0 likes

This is way too much work for someone like me,

but I guess if we're desperate enough someone here might be inclined to give it a shot.

https://medium.com/searce/generate-srt- ... 2b2f1da3bd
Generate SRT File (Subtitles) using Google Cloud’s Speech-to-Text API - by Darshan Majithiya

If you followed the mentioned blog, you would realize that in the age of automation, creating an SRT file involves a lot of manual labor. Can we somehow minimize these efforts?
Use pre-trained APIs — Now there are multiple pre-trained APIs that can do this job efficiently. Added benefits are — they require less time to set up, easy to learn, and are cost-efficient. We will use one such API to generate subtitles — Google Cloud’s Speech-to-Text API.

Let’s Get Started!
Pre-requisites
You need to have Git, Python 3.7 and ffmpeg installed on your system.
You need to have a Google Cloud project with billing enabled. Follow Creating and managing projects to set this up.
Also, a service account with the right to use Speech-to-Text API. Download the service account credentials as credentials.json. Follow Creating and managing service accounts to set this up.

Setting Up the Environment
Enable the Speech-to-Text API in your Google Cloud Project. From the navigation bar, go to APIs & Services > Library > Cloud Speech-to-Text API and Click on Enable.
Now, run below commands from your Terminal

Clone the repository —
git clone git@github.com:darshan-majithiya/Generate-SRT-File-using-Google-Cloud-s-Speech-to-Text-API.git

Install the requirements —
cd Generate-SRT-File-using-Google-Cloud-s-Speech-to-Text-API
pip install -r requirements.txt

Move your credentials.json here and then export the credentials —
export GOOGLE_APPLICATION_CREDENTIALS="credentials.json"

Data Preparation
I’m a Suits fan so I’ll use this video for the demonstration. Feel free to use any other video.
I’ll download this video using the pytube3 module.
Now, that the video is downloaded, let’s move ahead and get the various attributes of this video.

Getting the Number of Channels, Bit Rate, and Sample Rate of the Video
We need these attribute values to transform video to audio which will later be used by Speech-to-Text API.
Channels — it’s the passage or communication channel in which a sound signal is transported.
Sampling Rate — defines how many times per second a sound is sampled.
Bit Rate — is the number of bits encoded per second or number of bits transmitted/received per second. Higher the bit rate with a higher sampling rate implies good quality audio.
I’ve used pydub module to extract these attributes.

Converting Video to Audio & Upload to GCS
Transform the video to audio to be used by Speech-to-Text API and storing it on GCS because for audio > 1 min and size ≥ 10 MB, API requires the audio to be stored in the bucket.

Transcribing the Audio
Configuration
Before diving into transcribing the audio, let’s talk about the configuration required.

config = {
"language_code": "en-US",
"sample_rate_hertz": int(sample_rate),
"encoding": enums.RecognitionConfig.AudioEncoding.LINEAR16,
"audio_channel_count": int(channels),
"enable_word_time_offsets": True,
"model": "video",
"enable_automatic_punctuation":True
}

language_code — The language used in your video/audio. You can check all the supported languages here.
sample_rate_hertz — Sample rate of the video/audio which we extracted using pydub module.
encoding — Speech-to-Text API only supports a specific type of audio encodings. You can find all the supported encodings here.
audio_channel_count — The number of channels used by video/audio.
enable_word_time_offsets — If True, gives start time and end time of each word.
model — The model that will be used by API for transcription. Our original Source is a Video, so we’ve used a “video” model. But there are various other models available, you can check them here.
enable_automatic_punctuation — If True, it also tries to detect the punctuations.

Other then all these, there are many other features available such as Speaker Diarization, Word level confidence, Separate Transcription for each channel, etc. Speech-to-Text API also supports transcription for live streaming.

Transcribe
The below long_running_recognize function transcribes the video file and returns response object which contents transcripts, confidence, words, start_time & end_time for each word.

Code: Select all

def long_running_recognize(storage_uri, channels, sample_rate):
    
    client = speech_v1.SpeechClient()

    config = {
        "language_code": "en-US",
        "sample_rate_hertz": int(sample_rate),
        "encoding": enums.RecognitionConfig.AudioEncoding.LINEAR16,
        "audio_channel_count": int(channels),
        "enable_word_time_offsets": True,
        "model": "video",
        "enable_automatic_punctuation":True
    }
    audio = {"uri": storage_uri}

    operation = client.long_running_recognize(config, audio)

    print(u"Waiting for operation to complete...")
    response = operation.result()
    return response

SRT File Generation
For the generation of SRT File, I’ve used srt module. Below code converts the response object from Speech-to-Text API into SRT format string. The code looks a bit complicated and it is because each variable is required to display subtitles in sync with the audio when using it in a media player.

Code: Select all

def subtitle_generation(speech_to_text_response, bin_size=3):
    """We define a bin of time period to display the words in sync with audio. 
    Here, bin_size = 3 means each bin is of 3 secs. 
    All the words in the interval of 3 secs in result will be grouped togather."""
    transcriptions = []
    index = 0
 
    for result in response.results:
        try:
            if result.alternatives[0].words[0].start_time.seconds:
                # bin start -> for first word of result
                start_sec = result.alternatives[0].words[0].start_time.seconds 
                start_microsec = result.alternatives[0].words[0].start_time.nanos * 0.001
            else:
                # bin start -> For First word of response
                start_sec = 0
                start_microsec = 0 
            end_sec = start_sec + bin_size # bin end sec
            
            # for last word of result
            last_word_end_sec = result.alternatives[0].words[-1].end_time.seconds
            last_word_end_microsec = result.alternatives[0].words[-1].end_time.nanos * 0.001
            
            # bin transcript
            transcript = result.alternatives[0].words[0].word
            
            index += 1 # subtitle index

            for i in range(len(result.alternatives[0].words) - 1):
                try:
                    word = result.alternatives[0].words[i + 1].word
                    word_start_sec = result.alternatives[0].words[i + 1].start_time.seconds
                    word_start_microsec = result.alternatives[0].words[i + 1].start_time.nanos * 0.001 # 0.001 to convert nana -> micro
                    word_end_sec = result.alternatives[0].words[i + 1].end_time.seconds
                    word_end_microsec = result.alternatives[0].words[i + 1].end_time.nanos * 0.001

                    if word_end_sec < end_sec:
                        transcript = transcript + " " + word
                    else:
                        previous_word_end_sec = result.alternatives[0].words[i].end_time.seconds
                        previous_word_end_microsec = result.alternatives[0].words[i].end_time.nanos * 0.001
                        
                        # append bin transcript
                        transcriptions.append(srt.Subtitle(index, datetime.timedelta(0, start_sec, start_microsec), datetime.timedelta(0, previous_word_end_sec, previous_word_end_microsec), transcript))
                        
                        # reset bin parameters
                        start_sec = word_start_sec
                        start_microsec = word_start_microsec
                        end_sec = start_sec + bin_size
                        transcript = result.alternatives[0].words[i + 1].word
                        
                        index += 1
                except IndexError:
                    pass
            # append transcript of last transcript in bin
            transcriptions.append(srt.Subtitle(index, datetime.timedelta(0, start_sec, start_microsec), datetime.timedelta(0, last_word_end_sec, last_word_end_microsec), transcript))
            index += 1
        except IndexError:
            pass
    
    # turn transcription list into subtitles
    subtitles = srt.compose(transcriptions)
    return subtitles

The IndexError exception handling helps with the results for the time period in audio when there’s a long silence.

You can then save the subtitles as—

Code: Select all

with open("subtitles.srt", "w") as f:
    f.write(subtitles)

The entire code for this article can be found on Github.
https://github.com/darshan-majithiya/Ge ... o-Text-API

Conclusion
Whether you want to perform Speaker Diarization Or want to transcribe customer calls for performance analysis of your employees Or maybe generate subtitles for your live streaming for viewers — You can rely on Google Cloud’s Speech-to-Text API to do it for you in an efficient way with minimum efforts.

Post by **ghost** » Sat Nov 14, 2020 11:09 am 0 likes

Sure it sounds very interesting.

If I should get a lot of time (which I really don't have atm

) , i will give it a try

Thanks for letting us now, emuler!

Post by **kev** » Sat Nov 14, 2020 5:13 pm 0 likes

ghost wrote:Sure it sounds very interesting...

I'll second that!

You put a LOT of work into your investigation and post! Many MANY thanks for bringing that info to FLM. (I'm still blown away at all the details you posted!!)

With many [I'd even venture a MAJORITY] of our films coming from non-English speaking countries, it's a great thing [for someone like me] to have this knowledge available to use. Like ghost said, it's going to definitely take time to work with and put to use. But it's here now, thanks to you!!

Thanks again, emuler!

kev.

Post by **kev** » Wed Nov 18, 2020 4:50 am 1 likes

BAM!!

Thread is 'sticky-ized'!

Thanks emuler for the work you put into gathering and laying out in a fairly simplified [to the layman] fashion, this information.

This will make this info easier for interested members to find!

kev.

David32441 · Post by **David32441** » Wed Feb 09, 2022 2:33 pm 0 likes

As an alternative, and for anyone wanting to try translation ... a far longer, slower way is using:
Aegisub - best subtitling software out there as it simultaneously displays the subtitles, video and audio waveform (for precise start/stop timing).
Microsoft Translator (on phone app) - Which in my experience doing 'Inseperables' was superior to other free translation phone app software. Which I presume uses some AI. But for lines it doesn't catch you get the benefit of replaying it at different volume levels, even trying other apps. Also knowing the characters names, places, story helps you know when the translation AI gets it right or has screwed up. Often a character name will get translated into a word ruining the whole sentence. These kind of issues are best detected by the subtitler - that's what I found on several occasions. eg. the name Petia kept coming up - but I knew it meant the main character. So only when the translation came out with Petia or something close did I know it had worked!

Triela · Post by **Triela** » Sun Oct 15, 2023 9:51 pm 0 likes

two questions:
1. would this work on other languages with different alphabets, let's say Japanese and Russian? Because there are a lot of Japanese movies out there without subtitles....

You need to have a Google Cloud project with billing enabled.

2. Do I understand you correctly that you would need to pay Google for every time you translate a movie?

Post by **Night457** » Sun Oct 15, 2023 11:35 pm 1 likes

I have not used but simply read about it and seen the results, but I think Whisper has rendered this thread obsolete.

Use Google API to create subtitles

Use Google API to create subtitles

Re: Use Google API to create subtitles

Re: Use Google API to create subtitles

Re: Use Google API to create subtitles

Re: Use Google API to create subtitles

Re: Use Google API to create subtitles

Re: Use Google API to create subtitles