Speech Recognition Channels – Transcribe Speech into Subtitles/Text | pyVideoTrans Official - Open Source Free Video Translation & Dubbing Software pyvideotrans.com pyvideotrans github github.com/jianchang512/pyvideotrans

The function of the Speech Recognition Channels is to recognize text from the spoken audio in a video and organize it into subtitles with precise timestamps.

Speech Recognition Channels

Friendly Tip: Some channels, such as OpenAI and ByteDance's Volcano Engine, require you to pre-configure the API address or secret key (SK) before use. Don't worry, it's simple! Just click the "Speech Recognition Settings" menu at the top of the software and fill in the corresponding information.

Configure API Keys and other info in "Speech Recognition Settings"

Currently Supported Speech Recognition Channels (Over Ten Types)

To meet the needs of different users, we offer a variety of choices, covering local offline models and online cloud services.

Click on a channel name to view its detailed usage instructions.

💻 Local Offline Recognition (No Internet Required, Privacy Protected)

These channels require downloading model files to your computer the first time you use them, after which they can run completely offline.

faster-whisper (Local Mode): A very popular local recognition solution. It is known for its speed and low resource usage, while supporting recognition for dozens of languages. It is currently one of the preferred options for local recognition.
openai-whisper (Local Mode): The official open-source model from OpenAI, known for high recognition accuracy and support for a wide range of languages.
Alibaba FunASR (Chinese Recognition): An open-source model launched by Alibaba's DAMO Academy, specifically optimized for Chinese scenarios. It performs quite accurately in recognizing Chinese pronunciation and sentence segmentation.
Huggingface_ASR Speech Recognition Channel, supports 8 models from moonshine and one English model from NVIDIA.
faster-whisper-xxl.exe: This is an extra-large model version specifically designed for Windows users, offering better recognition results. You need to download the faster-whisper-xxl.exe file yourself to use it.
whisper.c: This is a recognition channel using whisper.cpp as the backend. You need to deploy the whisper.cpp file yourself to use it.
Parakeet-tdt Speech Recognition: An open-source recognition model from NVIDIA. This requires you to deploy the service yourself, then enter your API address in the software's settings menu.
STT Speech Recognition API: Another open-source project that requires you to deploy it yourself. After deployment, enter the API address into the software to use it.

☁️ Online Recognition (Cloud Processing, Powerful Features)

These channels upload audio files to cloud servers for processing, typically offering excellent results, but some services require payment or have usage limits.

Free or with a free tier:

Ali Bailian Qwen3-ASR: Based on Alibaba's "Tongyi Qianwen" large model. You need to go to the Ali Bailian platform to activate the service and create an API Key.
Elevenlabs.io Speech Recognition: A service provided by a company specializing in AI audio technology. You need to register on their official website to get a free API Key, with limited quotas for the free version.
deepgram.com Speech Recognition: A well-known speech recognition service provider, known for high accuracy and speed. You need to register and apply for an API Key on their official website deepgram.com.
Gemini Large Model Recognition: A powerful model launched by Google, with outstanding ability to recognize less common languages. Requires a Gemini API KEY to use, but access from within China requires a VPN.
Google Speech Recognition: A free online recognition service provided by Google. The performance is decent, but using it within China requires a VPN.

Requires payment or application for an API Key:

302.AI Speech Recognition: Visit the official website 302.ai to apply for an app key, then you can use it.
ByteDance Volcano Subtitle Generation: A professional speech technology service provided by ByteDance's Volcano Engine. Its Chinese recognition is exceptionally effective, especially suitable for audio with accents or background noise. You need to activate the service on the Volcano Engine official website.
OpenAI Speech Recognition: Uses the official OpenAI API for recognition, with effects as excellent as the local Whisper version, but requires you to have an OpenAI API key (SK).

🔧 Advanced Customization Options (Suitable for Developers)

If you have some technical background, you can also try the following more flexible solutions:

Custom Speech Recognition API: If you have programming skills, you can write your own fully customized speech recognition API interface based on the data format standards we provide, achieving the highest degree of customization.

Currently Supported Speech Recognition Channels (Over Ten Types) ​

💻 Local Offline Recognition (No Internet Required, Privacy Protected) ​

☁️ Online Recognition (Cloud Processing, Powerful Features) ​

🔧 Advanced Customization Options (Suitable for Developers) ​

Currently Supported Speech Recognition Channels (Over Ten Types)

💻 Local Offline Recognition (No Internet Required, Privacy Protected)

☁️ Online Recognition (Cloud Processing, Powerful Features)

🔧 Advanced Customization Options (Suitable for Developers)