Skip to content

The core principle of video translation software is: to recognize text from the spoken audio in the video, then translate the text into the target language, dub the translated text, and finally embed the dubbing and text into the video.

The first step is to recognize text from the spoken audio in the video, and the accuracy of recognition directly affects the subsequent translation and dubbing.

Faster Mode

Recommended for use. This is a model converted from OpenAI's open-source whisper, as the name suggests, the recognition speed is faster without reducing accuracy.

image.png

After selecting the faster mode, you can select the model to use on the right. The default built-in tiny model is the smallest model, and the effect is the least accurate.

image.png

The tiny--base--small--medium--large model sizes are increasing, and the recognition accuracy is also increasing.

For Chinese videos, it is recommended to choose at least the medium model. The model download address is at https://pyvideotrans.com/model

Models with the .en suffix and models starting with distil can only be used for English videos.

There is also a Whole Recognition drop-down box on the right side of the model. The drop-down will display Equal Segmentation. Generally, there is no special need to select Whole Recognition. If you need to segment the audio into equal-length parts, such as wanting each subtitle to be 10s long, you can choose Equal Segmentation. And set the segment duration in seconds in Menu -- Tools/Advanced Settings -- Advanced Settings -- VAD parameters section.

To speed up the task, on Windows and Linux, if you have an NVIDIA graphics card, you can configure and install the CUDA and cuDNN environment, and then enable CUDA acceleration, which will significantly improve the execution speed.

image.png

CUDA and cuDNN installation tutorial: https://pyvideotrans.com/gpu.html

Automatic Language Detection

After version v2.59, a "Auto Detect" option has been added to the original language drop-down box. When you don't know what language it is or the language is not among the 24 supported languages, you can select the "Auto Detect" option, and the program will try to automatically identify the spoken language.

Of course, if possible, try to avoid using this option, especially when there is no clear spoken sound in the first 30 seconds of the video, because the automatic detection principle is to use the first 30 seconds of audio to judge, in order to set the language used for the entire video. Another point to note: some languages that sound similar but have different spellings cannot be accurately identified, and may be identified as any one of them, for example, Chinese videos may be randomly identified as simplified or traditional Chinese.