The core principle of video translation software is: to recognize text from the spoken audio in a video, then translate the text into the target language, then dub the translated text, and finally embed the dubbing and text into the video.
As you can see, the first step is to recognize text from the spoken audio in the video. The accuracy of this recognition directly affects the subsequent translation and dubbing.
OpenAI-Whisper Local Mode
This mode utilizes the official OpenAI open-source Whisper model. Compared to the faster mode, it's slower but maintains the same accuracy.
The model selection method on the right is the same. Choosing from tiny
to large-v3
will consume more computer resources and increase accuracy accordingly.
Note: Although the faster mode and OpenAI mode share many of the same model names, the models are not interchangeable. Please download the models specifically for the OpenAI mode from https://github.com/jianchang512/stt/releases/0.0.
large-v3-turbo Model
OpenAI-Whisper has recently released a model optimized from large-v3 called large-v3-turbo. It offers similar recognition accuracy to the former while significantly reducing size and resource consumption, making it a suitable replacement for large-v3.
How to Use:
- Update the software to version v2.67.
- After speech recognition, select "openai-whisper local" in the dropdown.
- Select "large-v3-turbo" in the model dropdown.
- Download the
large-v3-turbo.pt
file into themodels
folder within the software directory.