Skip to content

The Speech Recognition Channel function identifies text from spoken audio in videos and organizes it into subtitles with precise timestamps.

Friendly Tip: Some channels, such as OpenAI and ByteDance Volcano Engine, require you to pre-set the API address or secret key (SK) before use. Don't worry, it's simple! Just click the "Speech Recognition Settings" menu at the top of the software and enter the relevant information.

Currently Supported 15 Speech Recognition Channels

To meet the needs of different users, we offer a variety of options, including local offline models and online cloud services.

Click on the channel name to view detailed usage instructions for that channel.

💻 Local Offline Recognition (No Internet Required, Privacy Protected)

These channels require downloading model files to your computer the first time you use them, after which they can run completely offline.

  • faster-whisper (Local Mode): A very popular local recognition solution. Known for its speed and low resource usage, it supports recognition in dozens of languages and is one of the preferred choices for local recognition.
  • openai-whisper (Local Mode): An open-source model from OpenAI, offering high recognition accuracy and support for a wide range of languages.
  • Alibaba FunASR (Chinese Recognition): An open-source model launched by Alibaba DAMO Academy, specifically optimized for Chinese scenarios, providing accurate pronunciation and sentence segmentation for Chinese speech.
  • faster-whisper-xxl.exe: A large model version specifically designed for Windows users, offering better recognition results. You need to download the faster-whisper-xxl.exe file separately to use it.
  • Parakeet-tdt Speech Recognition: A recognition model open-sourced by NVIDIA. This requires you to deploy the service yourself and then enter your API address in the software's settings menu.
  • STT Speech Recognition API: Another open-source project that requires self-deployment. After deployment, enter the API address into the software to use it.

☁️ Online Recognition (Cloud Processing, Powerful Features)

These channels upload audio files to cloud servers for processing, typically offering excellent results, but some services may require payment or have usage limits.

Free or with Free Quotas:

  • Google Speech Recognition: A free online recognition service provided by Google. It performs reasonably well, but requires a VPN for use in China.
  • Elevenlabs.io Speech Recognition: A service from a company specializing in AI audio technology. You need to register on their official website and obtain a free API Key, with limited quotas for the free version.
  • deepgram.com Speech Recognition: A well-known speech recognition service provider, known for high accuracy and fast speed. You need to register on their official website deepgram.com and apply for an API Key.
  • Gemini Large Model Recognition: A powerful model launched by Google, excelling in recognizing less common languages. Requires a Gemini API KEY for use, but a VPN is needed for access in China.
  • Alibaba Bailian Qwen3-ASR: Based on Alibaba's "Tongyi Qianwen" large model. You need to activate the service on the Alibaba Bailian platform and create an API Key.

Requires Payment or API Key Application:

  • 302.AI Speech Recognition: Visit the 302.ai official website to apply for an app key to use the service.
  • ByteDance Volcano Subtitle Generation: A professional speech technology service provided by ByteDance's Volcano Engine. It offers excellent Chinese recognition, especially suitable for audio with accents or background noise. Requires service activation on the Volcano Engine official website.
  • OpenAI Speech Recognition: Uses the official OpenAI API for recognition, with results as excellent as the local Whisper version, but requires you to have an OpenAI API key (SK).

🔧 Advanced Customization Options (Suitable for Developers)

If you have some technical background, you can also try the following more flexible solutions:

  • Custom Speech Recognition API: If you have programming skills, you can write your own fully customized speech recognition API interface based on the data format standards we provide, achieving maximum customization.