Skip to content

Thanks to the rapid advancement of AI technology, video translation, once a challenging task, has become significantly more achievable, although the results may not yet be perfect.

Video translation is more complex than text translation, but the core remains text-based translation (although technology exists to directly convert sound into another language's sound, this method is currently not mature enough and has limited practicality).

The workflow of video translation can be broadly divided into the following stages:

  1. Speech Recognition: Extracting human voices from the video and converting them into text;

  2. Text Translation: Translating the extracted text into the target language;

  3. Speech Synthesis: Generating the target language's speech based on the translated text;

  4. Synchronization Adjustment: Ensuring that the dubbed audio and subtitle files are synchronized with the video content;

  5. Embedding Processing: Embedding the translated subtitles and dubbing into the video to generate a new video file.

Detailed Discussion of Each Stage:

Speech Recognition

The goal of this step is to accurately convert the speech content in the video into text, along with timestamps. There are various implementation methods, including using OpenAI's Whisper model, Alibaba's FunASR series models, or directly calling online speech recognition APIs, such as Baidu Speech Recognition.

When choosing a model, you can select from small (tiny) to large (large-v3) according to your needs. The larger the model, the higher the recognition accuracy.

Text Translation

Once the text is obtained, translation can be performed. It is important to note that subtitle translation differs from ordinary text translation; subtitle translation needs to consider timestamp matching.

When using traditional translation engines (such as Baidu Translate, Tencent Translate), only the subtitle text lines should be transmitted for translation. Avoid transmitting line numbers and timestamp lines to prevent exceeding character limits or altering the subtitle format.

Ideally, the translated subtitles should have the same number of lines as the original subtitles, without any blank lines.

However, different translation engines, especially AI translation, intelligently merge lines based on context, especially when the next line has only a few characters or one or two words, and the semantics are coherent with the previous sentence. It is highly likely to merge this into the previous line.

While the translation result is more fluent and beautiful, it also causes the subtitles to not strictly match the original subtitles, resulting in blank lines.

Speech Synthesis

After translation, dubbing can be generated based on the translated subtitles.

Currently, EdgeTTS is a virtually unlimited and free dubbing channel. By sending subtitles line by line to EdgeTTS, you can obtain dubbed audio files, which are then merged into a complete audio file.

Synchronization Alignment Adjustment

Ensuring that the subtitles, audio, and video are synchronized is the biggest challenge in video translation.

Differences in pronunciation length between different languages are inevitable, leading to synchronization problems. Strategies to solve this problem include speeding up the audio playback speed or lengthening the video segment, and using the blank intervals between subtitles to adjust for optimal synchronization.

If no adjustment is made and the original subtitle timestamps are directly embedded, it will inevitably happen that the subtitles have disappeared, but the person is still speaking, or the person in the video has already finished speaking and is silent, yet the audio continues to play.

To solve this problem, there are two relatively simple methods:

One is to speed up audio playback, forcing it to finish within the subtitle time interval, which can achieve synchronization. The drawback is that the speech speed is sometimes fast and sometimes slow, resulting in a poor experience.

The second is to slow down the playback of the video segment in that subtitle area, that is, extend the video segment until its length matches the new dubbing length, which can also achieve synchronization. The drawback is that the picture will appear to stutter.

Both methods can be used simultaneously, i.e., speeding up the audio while extending the video segment, preventing both the audio from speeding up too much and the video from extending too much.

Depending on the actual video situation, you can also utilize the blank interval segments between two subtitles. First, try to see if the audio can be played normally within the subtitle specified interval without speeding up the audio, only speeding up within the blank interval. If it can, then there is no need to speed up, and the effect will be better. Of course, the drawback is that the video has already finished speaking, but the audio is still playing.

Synthesis Output

After completing the above steps, the translated subtitles and dubbing are embedded into the original video, which can be easily achieved using tools such as ffmpeg. The final generated video file completes the translation process.

ffmpeg -y -i original video.mp4 -i dubbed audio.m4a -c:v libx264 -c:a aac -vf subtitles=subtitles.srt out.mp4

Difficult Problem to Solve: Multi-Speaker Recognition

Speaker role recognition, i.e., synthesizing different dubbing for different characters in the video, involves speaker recognition and requires pre-specifying the number of speaker roles. This is barely suitable for ordinary one- or two-person dialogues, but for most videos, it is difficult to determine the number of speakers in advance, and the final synthesized effect is also poor. Therefore, this aspect is not considered for now.

Conclusion

The above is only a simple process principle. In reality, to achieve good translation results, there are many points to note, such as pre-processing of the original video input format (mov/mp4/avi/mkv), splitting the video into audio and silent video, separating human voices from background noise in audio, post-processing of batch translation results to speed up subtitle translation, re-splitting when blank lines appear in subtitles, dual subtitle generation and embedding, etc.

Through this series of steps, the video translation task can be successfully completed, seamlessly converting video content into the target language. Although some technical challenges may be encountered during the process, with continuous technological advancements and optimization, the quality and efficiency of video translation are expected to be further improved in the future.