The core principle of video translation software is: recognize text from speech in the video, translate it into the target language, dub the translated text, and finally embed the dubbed audio and text into the video.
As you can see, the first step is recognizing text from the speech in the video. The accuracy of this recognition directly impacts the subsequent translation and dubbing.
openai-whisper (Local) Speech Recognition Channel
This mode uses the official open-source Whisper model from OpenAI. Compared to faster-whisper, it is slower but offers slightly higher accuracy.

Model Selection
When using a model for the first time, it will be automatically downloaded from OpenAI-Whisper's official CDN.
tiny --> base --> small --> medium --> large-v3-turbo --> large-v1 --> large-v2 --> large-v3
The model size increases from front to back, along with higher recognition accuracy and greater demands on memory and VRAM. It is recommended to choose at least a model equal to or larger than large-v3-turbo, with the best overall performance being large-v3.
Optimal Settings for This Channel
For the best speech recognition results, refer to the following settings:
- Choose the
large-v3model (ensure your computer has more than 16GB RAM or 10GB VRAM). If this is not feasible, try usinglarge-v1orlarge-v3-turbo. - Clearly specify the spoken language to match the language used in the video's audio.
- In the Menu - Tools - Advanced Options - Speech Recognition Parameters section: Set
Minimum speech duration in millisecondsto 1000, and setMaximum speech duration in secondsto a value of 5 or greater. Do not selectWhisper pre-segment audio.
Note: If you need dubbing and the dubbing role is
clone(i.e., cloning the original voice timbre), it is highly recommended to setMinimum speech duration in millisecondsto 3000 andMaximum speech duration in secondsto 10. This is because voice cloning automatically uses the original audio segment corresponding to the subtitle duration as a reference, and most dubbing channels require this reference audio to be between 3-10 seconds; otherwise, dubbing may fail. Additionally, you should select bothWhisper pre-segment audioandMerge short subtitles with adjacent onesto ensure subtitle durations fall within the 3-10 second range.
- If the original speech is not clear or contains noise, select Noise reduction.
- If you are not using the
clonerole and prefer shorter subtitles (e.g., for vertical videos), you can reduce theMaximum speech duration in seconds, for example, to 3 or 2. If dubbing is involved, you may also selectSecondary recognition.
Secondary recognition: When dubbing is enabled and single subtitle embedding is chosen, selecting secondary recognition means that after dubbing is complete, the dubbed audio file will be re-transcribed to generate shorter subtitles embedded in the video, ensuring precise alignment between subtitles and dubbed audio.
CUDA Acceleration
To speed up tasks on Windows and Linux, if you have an NVIDIA GPU, you can configure and install the CUDA and cuDNN environment to enable CUDA Acceleration, which will significantly improve execution speed.

