Feature: Batch Audio-to-Subtitle

Supported video formats:
mp4/mov/avi/mkv/webm/mpeg/ogg/mts/tsSupported audio formats:
wav/mp3/m4a/flac/aac
This is a dedicated panel for transcribing audio/video files into text or subtitles. If you sometimes don't need to translate a video, but just want to generate subtitles from audio/video in batches, this feature is perfect for you.
Batch transcribe video or audio files into subtitles or txt. Simply drag and drop files, set the original language (spoken language) and recognition model, and you're ready to start. Supports advanced features like re-sentence segmentation, noise reduction, and speaker identification.
The large button at the top: click or drag and drop the audio/video files you want to transcribe, one or multiple.
Enable CUDA: If you have an NVIDIA graphics card with CUDA configured on Windows or Linux, check this to speed up transcription.
Original Language: The spoken language in the audio/video. Please select correctly, otherwise transcription will definitely fail. If unsure, choose
autoat the bottom of the dropdown.Speech Recognition: Choose the method for speech transcription. For general needs, select
faster-whisper.faster-whisper (Local): This is a local model (download required on first run). Good speed and quality. For general needs, choose it. It offers over ten models of varying sizes. The smallest, fastest, and most resource-efficient model istiny, but accuracy is low and not recommended. The best performers arelarge-v2/large-v3. It is recommended to select them. Models ending with.enor starting withdistil-only support English videos.openai-whisper (Local): Similar to the model above, but slower. Accuracy might be slightly higher. Again,large-v2/large-v3are recommended.Alibaba FunASR (Local): Alibaba's local recognition model. Works well with Chinese. If your original video is in Chinese, try this. Also requires an online download on the first run.- Also supports various online APIs and local models like ByteDance Volcano Subtitle Generation, OpenAI Speech Recognition, Gemini Speech Recognition, and Alibaba Qwen3-ASR.
- Click here for details on all speech recognition channels
Select Model: Larger models are more accurate but slower and consume more resources.
Noise Reduction: If selected, noise in the audio is reduced before speech recognition, improving accuracy.
Speaker Identification: If selected, it will attempt to identify and differentiate speakers after speech recognition (accuracy is limited). The number below is a preset for how many speakers to identify. Setting it in advance can increase accuracy; the default is no limit.
Insert Speaker Tag: If selected, a speaker identifier (e.g.,
[spk0]) will be inserted at the beginning of each subtitle text.Default Segmentation | Local Re-segmentation | LLM Re-segmentation: Choose between default segmentation, using a large language model for intelligent sentence segmentation and punctuation optimization, or a local algorithm based on punctuation and duration.
Output Format: Default is
srtsubtitle format. Options includetxt,vtt, andass.Overall Recognition vs. Batch Inference:
Overall Recognitionuses built-in VAD (Voice Activity Detection) for better sentence segmentation.Batch Inferencesplits the audio based on a setmaximum speech duration, then recognizes 16 segments simultaneously, which is faster but segmentation is slightly less precise.Output Subtitles to Original Location: If selected, transcription results will be saved in the same folder as the original audio/video file.
Open Output Directory: Click this button to open the directory where transcription results are saved. Files are saved with the same name as the original audio/video file.
Feature: Multi-Role Voice Dubbing / Speech Synthesis per Subtitle
Supported subtitle or text format for dubbing:
srt
Similar to the Batch Subtitle Dubbing feature, the difference is that this function allows assigning a different speaker to each line of subtitles, achieving multi-role dubbing.


