The core principle of video translation software is: to recognize text from the spoken audio in the video, then translate the text into the target language, dub the translated text, and finally embed the dubbing and text into the video.
The first step is to recognize text from the spoken audio in the video, and the accuracy of recognition directly affects the subsequent translation and dubbing.
Faster Mode
Recommended for use. This is a model converted from OpenAI's open-source whisper, as the name suggests, the recognition speed is faster without reducing accuracy.
After selecting the faster mode
, you can select the model to use on the right. The default built-in tiny
model is the smallest model, and the effect is the least accurate.
The tiny--base--small--medium--large
model sizes are increasing, and the recognition accuracy is also increasing.
For Chinese videos, it is recommended to choose at least the medium
model. The model download address is at https://pyvideotrans.com/model
Models with the .en
suffix and models starting with distil
can only be used for English videos.
There is also a Whole Recognition
drop-down box on the right side of the model. The drop-down will display Equal Segmentation
. Generally, there is no special need to select Whole Recognition
. If you need to segment the audio into equal-length parts, such as wanting each subtitle to be 10s long, you can choose Equal Segmentation
. And set the segment duration in seconds in Menu -- Tools/Advanced Settings -- Advanced Settings -- VAD parameters section.
To speed up the task, on Windows and Linux, if you have an NVIDIA graphics card, you can configure and install the CUDA and cuDNN environment, and then enable CUDA acceleration
, which will significantly improve the execution speed.
CUDA and cuDNN installation tutorial: https://pyvideotrans.com/gpu.html
Automatic Language Detection
After version v2.59, a "Auto Detect" option has been added to the original language drop-down box. When you don't know what language it is or the language is not among the 24 supported languages, you can select the "Auto Detect" option, and the program will try to automatically identify the spoken language.
Of course, if possible, try to avoid using this option, especially when there is no clear spoken sound in the first 30 seconds of the video, because the automatic detection principle is to use the first 30 seconds of audio to judge, in order to set the language used for the entire video. Another point to note: some languages that sound similar but have different spellings cannot be accurately identified, and may be identified as any one of them, for example, Chinese videos may be randomly identified as simplified or traditional Chinese.