Skip to content

The core principle of video translation software is: recognize text from speech in the video, translate it into the target language, dub the translated text, and finally embed the dubbed audio and text into the video.

As you can see, the first step is recognizing text from the speech in the video. The accuracy of this recognition directly impacts the subsequent translation and dubbing.

openai-whisper (Local) Speech Recognition Channel

This mode uses the official open-source Whisper model from OpenAI. Compared to faster-whisper, it is slower but offers slightly higher accuracy.

image.png

Model Selection

When using a model for the first time, it will be automatically downloaded from OpenAI-Whisper's official CDN.

tiny --> base --> small --> medium --> large-v3-turbo --> large-v1 --> large-v2 --> large-v3

The model size increases from front to back, along with higher recognition accuracy and greater demands on memory and VRAM. It is recommended to choose at least a model equal to or larger than large-v3-turbo, with the best overall performance being large-v3.

Optimal Settings for This Channel

For the best speech recognition results, refer to the following settings:

  1. Choose the large-v3 model (ensure your computer has more than 16GB RAM or 10GB VRAM). If this is not feasible, try using large-v1 or large-v3-turbo.
  2. Clearly specify the spoken language to match the language used in the video's audio.
  3. In the Menu - Tools - Advanced Options - Speech Recognition Parameters section: Set Minimum speech duration in milliseconds to 1000, and set Maximum speech duration in seconds to a value of 5 or greater. Do not select Whisper pre-segment audio.

Note: If you need dubbing and the dubbing role is clone (i.e., cloning the original voice timbre), it is highly recommended to set Minimum speech duration in milliseconds to 3000 and Maximum speech duration in seconds to 10. This is because voice cloning automatically uses the original audio segment corresponding to the subtitle duration as a reference, and most dubbing channels require this reference audio to be between 3-10 seconds; otherwise, dubbing may fail. Additionally, you should select both Whisper pre-segment audio and Merge short subtitles with adjacent ones to ensure subtitle durations fall within the 3-10 second range.

  1. If the original speech is not clear or contains noise, select Noise reduction.
  2. If you are not using the clone role and prefer shorter subtitles (e.g., for vertical videos), you can reduce the Maximum speech duration in seconds, for example, to 3 or 2. If dubbing is involved, you may also select Secondary recognition.

Secondary recognition: When dubbing is enabled and single subtitle embedding is chosen, selecting secondary recognition means that after dubbing is complete, the dubbed audio file will be re-transcribed to generate shorter subtitles embedded in the video, ensuring precise alignment between subtitles and dubbed audio.

CUDA Acceleration

To speed up tasks on Windows and Linux, if you have an NVIDIA GPU, you can configure and install the CUDA and cuDNN environment to enable CUDA Acceleration, which will significantly improve execution speed.

image.png

View CUDA and cuDNN installation guide