Subtitles can be really easy or really hard to get right and it depends on how you’re doing it. If you have someone transcribing the audio, then the chances of getting the subtitles correct are pretty high! On the other hand, if you are using ML / AI or any other technology to do the transcribing, then your results may not always be correct.
Automated Subtitling is a minefield
When I read about Netflix using AI for creating subtitles, it reminded me of a class I took on Pattern Recognition while I was in grad school. The professor was explaining how a research group at a certain mobile giant (struggling badly now) built ML models using several thousands of hours of speech.
When they took used model for live testing on their actual customers, it performed horribly. The team could not understand what went wrong, until they realized the obvious.
Get this – for some odd reason, the R&D group had used their own voice samples to train the model and had overlooked a simple fact – the entire team was made of researchers from China who had come over to the US to study and work. Their sample set was completely biased because of their accents + pronunciations, and as a result, it failed in real-world testing.
I see a similar problem when it comes to automated subtitling. You need to have your software trained extremely well to distinguish accents and dialects.
I purposely did not say “language”, because I feel it is better to have a different model per language. But, accents and dialects will play a massive role when it comes to training and testing such ML models.
What’s involved in Automated Subtitling?
The main thing that you need to consider is that there are two main jobs that need to be accomplished for automated subtitling
- transcribing (e.g. English words being spoken into English text to be printed on the screen)
- translation (from English to Spanish or Hindi)
Both present huge technical challenges. The first (transcribing) has challenges in understanding the way different people speak the same language. You could have an awesome algorithm that can transcribe speech really well and then a person comes along with a speech impairment like a lisp and it can confuse the algorithm.
Machine Translation is also a huge research topic and is really hard and we’ll see why.
What can go wrong with Machine Translation?
Suppose a character says to another – “Hey you – beat it!”
What does this seemingly innocent four-word sentence mean?
Is he telling the other person to go beat/hit something? NO! He’s telling the other person to go away or go do something else. The Cambridge dictionary defines it as follows
But, how are you going to teach your AI/ML model this nuance? Transcribing the text is one thing, but, translating it considering cultural differences, usage of speech is a completely different ball game and I am glad that big research teams like the ones at Netflix are taking a stab at it.
Netflix’s idea for Automated Subtitle Translation
Netflix recently published a paper that describes what they do and its very interesting once you get past the initial jargon. Nvidia has also blogged about it here because Netflix used Nvidia GPUs to crunch through the data.
Netflix is using a novel approach here. Instead of directly translating a sentence, they first “simplify” the sentence using a corpus of words and phrases and translate that instead. You can read more about Text Simplification here.
In our example, Netflix’s algorithm would simplify “beat it” to something like “go away” and then translate it into other languages. Clever idea, right?
Netflix’s simplification model, which they call the automatic pre-processing model (APP) is applied to English-language sources and the output of this step is sent to the machine translation step.
You can read the paper or the Nvidia article for further details. But, essentially, they use a simplification step followed by the translation step. I haven’t found details on their transcribing process. If you know anything about it, do let me know.
Is Netflix the only ones in the Automated Subtitling game?
There are many companies in the automated subtitling space and some of them are
This is a very interesting space and especially, in a country like India where there are 22 official languages (each with their own script, dictionary, dialects, accents) and with a young, media-hungry, growing population, this is a fantastic space to keep an eye out for!
Krishna Rao Vijayanagar
I’m Dr. Krishna Rao Vijayanagar, founder of OTTVerse. I have a Ph.D. in Video Compression from the Illinois Institute of Technology, and I have worked on Video Compression (AVC, HEVC, MultiView Plus Depth), ABR streaming, and Video Analytics (QoE, Content & Audience, and Ad) for several years.
I hope to use my experience and love for video streaming to bring you information and insights into the OTT universe.