Use Google API to create subtitles

All hardware and software related discussion topics here. Advice, discussion and opinions on either topic are welcome.
User avatar
emuler
Posts: 4617
Likes:
Joined: Sun Apr 02, 2006 1:00 am

Use Google API to create subtitles

Post by emuler »   0 likes

This is way too much work for someone like me, :shock: :eyecrazy but I guess if we're desperate enough someone here might be inclined to give it a shot.

https://medium.com/searce/generate-srt- ... 2b2f1da3bd
Generate SRT File (Subtitles) using Google Cloud’s Speech-to-Text API - by Darshan Majithiya

If you followed the mentioned blog, you would realize that in the age of automation, creating an SRT file involves a lot of manual labor. Can we somehow minimize these efforts?
Use pre-trained APIs — Now there are multiple pre-trained APIs that can do this job efficiently. Added benefits are — they require less time to set up, easy to learn, and are cost-efficient. We will use one such API to generate subtitles — Google Cloud’s Speech-to-Text API.

Let’s Get Started!
Pre-requisites
You need to have Git, Python 3.7 and ffmpeg installed on your system.
You need to have a Google Cloud project with billing enabled. Follow Creating and managing projects to set this up.
Also, a service account with the right to use Speech-to-Text API. Download the service account credentials as credentials.json. Follow Creating and managing service accounts to set this up.

Setting Up the Environment
Enable the Speech-to-Text API in your Google Cloud Project. From the navigation bar, go to APIs & Services > Library > Cloud Speech-to-Text API and Click on Enable.
Now, run below commands from your Terminal

Clone the repository —
git clone git@github.com:darshan-majithiya/Generate-SRT-File-using-Google-Cloud-s-Speech-to-Text-API.git

Install the requirements —
cd Generate-SRT-File-using-Google-Cloud-s-Speech-to-Text-API
pip install -r requirements.txt

Move your credentials.json here and then export the credentials —
export GOOGLE_APPLICATION_CREDENTIALS="credentials.json"

Data Preparation
I’m a Suits fan so I’ll use this video for the demonstration. Feel free to use any other video.
I’ll download this video using the pytube3 module.
Now, that the video is downloaded, let’s move ahead and get the various attributes of this video.

Getting the Number of Channels, Bit Rate, and Sample Rate of the Video
We need these attribute values to transform video to audio which will later be used by Speech-to-Text API.
Channels — it’s the passage or communication channel in which a sound signal is transported.
Sampling Rate — defines how many times per second a sound is sampled.
Bit Rate — is the number of bits encoded per second or number of bits transmitted/received per second. Higher the bit rate with a higher sampling rate implies good quality audio.
I’ve used pydub module to extract these attributes.

Converting Video to Audio & Upload to GCS
Transform the video to audio to be used by Speech-to-Text API and storing it on GCS because for audio > 1 min and size ≥ 10 MB, API requires the audio to be stored in the bucket.

Transcribing the Audio
Configuration
Before diving into transcribing the audio, let’s talk about the configuration required.

config = {
"language_code": "en-US",
"sample_rate_hertz": int(sample_rate),
"encoding": enums.RecognitionConfig.AudioEncoding.LINEAR16,
"audio_channel_count": int(channels),
"enable_word_time_offsets": True,
"model": "video",
"enable_automatic_punctuation":True
}

language_code — The language used in your video/audio. You can check all the supported languages here.
sample_rate_hertz — Sample rate of the video/audio which we extracted using pydub module.
encoding — Speech-to-Text API only supports a specific type of audio encodings. You can find all the supported encodings here.
audio_channel_count — The number of channels used by video/audio.
enable_word_time_offsets — If True, gives start time and end time of each word.
model — The model that will be used by API for transcription. Our original Source is a Video, so we’ve used a “video” model. But there are various other models available, you can check them here.
enable_automatic_punctuation — If True, it also tries to detect the punctuations.

Other then all these, there are many other features available such as Speaker Diarization, Word level confidence, Separate Transcription for each channel, etc. Speech-to-Text API also supports transcription for live streaming.

Transcribe
The below long_running_recognize function transcribes the video file and returns response object which contents transcripts, confidence, words, start_time & end_time for each word.

Code: Select all

def long_running_recognize(storage_uri, channels, sample_rate):
    
    client = speech_v1.SpeechClient()

    config = {
        "language_code": "en-US",
        "sample_rate_hertz": int(sample_rate),
        "encoding": enums.RecognitionConfig.AudioEncoding.LINEAR16,
        "audio_channel_count": int(channels),
        "enable_word_time_offsets": True,
        "model": "video",
        "enable_automatic_punctuation":True
    }
    audio = {"uri": storage_uri}

    operation = client.long_running_recognize(config, audio)

    print(u"Waiting for operation to complete...")
    response = operation.result()
    return response
SRT File Generation
For the generation of SRT File, I’ve used srt module. Below code converts the response object from Speech-to-Text API into SRT format string. The code looks a bit complicated and it is because each variable is required to display subtitles in sync with the audio when using it in a media player.

Code: Select all

def subtitle_generation(speech_to_text_response, bin_size=3):
    """We define a bin of time period to display the words in sync with audio. 
    Here, bin_size = 3 means each bin is of 3 secs. 
    All the words in the interval of 3 secs in result will be grouped togather."""
    transcriptions = []
    index = 0
 
    for result in response.results:
        try:
            if result.alternatives[0].words[0].start_time.seconds:
                # bin start -> for first word of result
                start_sec = result.alternatives[0].words[0].start_time.seconds 
                start_microsec = result.alternatives[0].words[0].start_time.nanos * 0.001
            else:
                # bin start -> For First word of response
                start_sec = 0
                start_microsec = 0 
            end_sec = start_sec + bin_size # bin end sec
            
            # for last word of result
            last_word_end_sec = result.alternatives[0].words[-1].end_time.seconds
            last_word_end_microsec = result.alternatives[0].words[-1].end_time.nanos * 0.001
            
            # bin transcript
            transcript = result.alternatives[0].words[0].word
            
            index += 1 # subtitle index

            for i in range(len(result.alternatives[0].words) - 1):
                try:
                    word = result.alternatives[0].words[i + 1].word
                    word_start_sec = result.alternatives[0].words[i + 1].start_time.seconds
                    word_start_microsec = result.alternatives[0].words[i + 1].start_time.nanos * 0.001 # 0.001 to convert nana -> micro
                    word_end_sec = result.alternatives[0].words[i + 1].end_time.seconds
                    word_end_microsec = result.alternatives[0].words[i + 1].end_time.nanos * 0.001

                    if word_end_sec < end_sec:
                        transcript = transcript + " " + word
                    else:
                        previous_word_end_sec = result.alternatives[0].words[i].end_time.seconds
                        previous_word_end_microsec = result.alternatives[0].words[i].end_time.nanos * 0.001
                        
                        # append bin transcript
                        transcriptions.append(srt.Subtitle(index, datetime.timedelta(0, start_sec, start_microsec), datetime.timedelta(0, previous_word_end_sec, previous_word_end_microsec), transcript))
                        
                        # reset bin parameters
                        start_sec = word_start_sec
                        start_microsec = word_start_microsec
                        end_sec = start_sec + bin_size
                        transcript = result.alternatives[0].words[i + 1].word
                        
                        index += 1
                except IndexError:
                    pass
            # append transcript of last transcript in bin
            transcriptions.append(srt.Subtitle(index, datetime.timedelta(0, start_sec, start_microsec), datetime.timedelta(0, last_word_end_sec, last_word_end_microsec), transcript))
            index += 1
        except IndexError:
            pass
    
    # turn transcription list into subtitles
    subtitles = srt.compose(transcriptions)
    return subtitles
The IndexError exception handling helps with the results for the time period in audio when there’s a long silence.

You can then save the subtitles as—

Code: Select all

with open("subtitles.srt", "w") as f:
    f.write(subtitles)
The entire code for this article can be found on Github.
https://github.com/darshan-majithiya/Ge ... o-Text-API

Conclusion
Whether you want to perform Speaker Diarization Or want to transcribe customer calls for performance analysis of your employees Or maybe generate subtitles for your live streaming for viewers — You can rely on Google Cloud’s Speech-to-Text API to do it for you in an efficient way with minimum efforts.
User avatar
ghost
Site Admin
Posts: 8460
Likes:
Joined: Sun Mar 07, 2004 1:00 am

Re: Use Google API to create subtitles

Post by ghost »   0 likes

Sure it sounds very interesting.

If I should get a lot of time (which I really don't have atm :( ) , i will give it a try ;)

Thanks for letting us now, emuler! :thumbsup
User avatar
kev
Site Admin
Posts: 3632
Likes:
Joined: Tue Jan 17, 2006 1:00 am

Re: Use Google API to create subtitles

Post by kev »   0 likes

ghost wrote:Sure it sounds very interesting...
I'll second that! :shock:

You put a LOT of work into your investigation and post! Many MANY thanks for bringing that info to FLM. (I'm still blown away at all the details you posted!!) :onfire

With many [I'd even venture a MAJORITY] of our films coming from non-English speaking countries, it's a great thing [for someone like me] to have this knowledge available to use. Like ghost said, it's going to definitely take time to work with and put to use. But it's here now, thanks to you!!

Thanks again, emuler! :clap

kev.
User avatar
kev
Site Admin
Posts: 3632
Likes:
Joined: Tue Jan 17, 2006 1:00 am

Re: Use Google API to create subtitles

Post by kev »   1 likes

BAM!!

Thread is 'sticky-ized'!

Thanks emuler for the work you put into gathering and laying out in a fairly simplified [to the layman] fashion, this information.

This will make this info easier for interested members to find! :thumbsup

kev.
David32441
Posts: 799
Likes:
Joined: Thu Jul 22, 2021 2:48 am

Re: Use Google API to create subtitles

Post by David32441 »   0 likes

As an alternative, and for anyone wanting to try translation ... a far longer, slower way is using:
Aegisub - best subtitling software out there as it simultaneously displays the subtitles, video and audio waveform (for precise start/stop timing).
Microsoft Translator (on phone app) - Which in my experience doing 'Inseperables' was superior to other free translation phone app software. Which I presume uses some AI. But for lines it doesn't catch you get the benefit of replaying it at different volume levels, even trying other apps. Also knowing the characters names, places, story helps you know when the translation AI gets it right or has screwed up. Often a character name will get translated into a word ruining the whole sentence. These kind of issues are best detected by the subtitler - that's what I found on several occasions. eg. the name Petia kept coming up - but I knew it meant the main character. So only when the translation came out with Petia or something close did I know it had worked!
User avatar
Triela
Posts: 418
Likes:
Joined: Sun Jul 05, 2020 3:42 pm

Re: Use Google API to create subtitles

Post by Triela »   0 likes

two questions:
1. would this work on other languages with different alphabets, let's say Japanese and Russian? Because there are a lot of Japanese movies out there without subtitles....
You need to have a Google Cloud project with billing enabled.

2. Do I understand you correctly that you would need to pay Google for every time you translate a movie?
User avatar
Night457
Global Moderator
Posts: 5222
Likes:
Joined: Sat Dec 28, 2019 3:44 pm

Re: Use Google API to create subtitles

Post by Night457 »   1 likes

I have not used but simply read about it and seen the results, but I think Whisper has rendered this thread obsolete.
Post Reply