Skip to content

Speech recognition, converting human speech from audio and video into text, is the first step in video translation and is crucial for the quality of subsequent dubbing and subtitles. Currently, the software primarily supports two locally offline recognition models: faster-whisper local and openai-whisper local.

image.png

The two are very similar. Essentially, faster-whisper is a refined and optimized product of openai-whisper. They have essentially the same recognition accuracy, but the former is faster. Correspondingly, the former requires more stringent environment configurations when using CUDA acceleration.

faster-whisper Local Recognition Mode

The software defaults to and recommends this mode, as it's faster and more efficient.

The model sizes in this mode range from smallest to largest: tiny -> base -> small -> medium -> large-v1 -> large-v3

image.png

From the first to the last, the model size gradually increases from 60MB to 2.7G, and the required memory, video memory, and CPU/GPU consumption also gradually increase. If your available video memory is less than 10G, it's not recommended to use large-v3, otherwise, crashes or freezes may occur.

From tiny to large-v3, as the size and resource consumption increase, the recognition accuracy also increases. tiny/base/small are small models; they are very fast and consume few resources, but their accuracy is low.

medium is a medium-sized model. If you need to recognize videos with Chinese pronunciation, it is recommended to use a model larger than or equal to medium, otherwise, the effect will be poor.

If the CPU is powerful enough and the memory is large enough, even without CUDA acceleration, you can choose the large-v1/v2 model. The accuracy will be significantly improved compared to the previous smaller models, although the recognition speed will be slower.

large-v3 consumes significant resources; unless the computer is powerful enough, it's not recommended. It is suggested to use large-v3-turbo instead. Both have the same accuracy, but large-v3-turbo is faster and consumes fewer resources.

Models ending with .en and those starting with distil can only be used for videos with English pronunciation. Do not use them for videos in other languages.

openai-whisper Local Recognition Mode

The models in this mode are basically the same as faster-whisper, ranging from smallest to largest: tiny -> base -> small -> medium -> large-v1 -> large-v3. Usage precautions are also consistent. tiny/base/small are small models, and large-v1/v2/v3 are large models.

Summary of Selection Methods

  1. It is recommended to prioritize the faster-whisper local mode. Unless you want to use CUDA acceleration but keep getting environment errors, you can use the openai-whisper local mode.
  2. Regardless of the mode, if you need to recognize videos with Chinese pronunciation, it is recommended to use at least the medium model, or at least the small model. For videos with English pronunciation, at least use small. Of course, if computer resources are sufficient, it is recommended to use large-v3-turbo.
  3. Models ending with .en and those starting with distil can only be used for videos with English pronunciation.