Skip to content

Feature: Batch Audio-to-Subtitle

stt

Supported video formats: mp4/mov/avi/mkv/webm/mpeg/ogg/mts/ts

Supported audio formats: wav/mp3/m4a/flac/aac

This is a dedicated panel for transcribing audio/video files into text or subtitles. If you sometimes don't need to translate a video, but just want to generate subtitles from audio/video in batches, this feature is perfect for you.

Batch transcribe video or audio files into subtitles or txt. Simply drag and drop files, set the original language (spoken language) and recognition model, and you're ready to start. Supports advanced features like re-sentence segmentation, noise reduction, and speaker identification.

The large button at the top: click or drag and drop the audio/video files you want to transcribe, one or multiple.

  • Enable CUDA: If you have an NVIDIA graphics card with CUDA configured on Windows or Linux, check this to speed up transcription.

  • Original Language: The spoken language in the audio/video. Please select correctly, otherwise transcription will definitely fail. If unsure, choose auto at the bottom of the dropdown.

  • Speech Recognition: Choose the method for speech transcription. For general needs, select faster-whisper.

    • faster-whisper (Local): This is a local model (download required on first run). Good speed and quality. For general needs, choose it. It offers over ten models of varying sizes. The smallest, fastest, and most resource-efficient model is tiny, but accuracy is low and not recommended. The best performers are large-v2/large-v3. It is recommended to select them. Models ending with .en or starting with distil- only support English videos.
    • openai-whisper (Local): Similar to the model above, but slower. Accuracy might be slightly higher. Again, large-v2/large-v3 are recommended.
    • Alibaba FunASR (Local): Alibaba's local recognition model. Works well with Chinese. If your original video is in Chinese, try this. Also requires an online download on the first run.
    • Also supports various online APIs and local models like ByteDance Volcano Subtitle Generation, OpenAI Speech Recognition, Gemini Speech Recognition, and Alibaba Qwen3-ASR.
    • Click here for details on all speech recognition channels
  • Select Model: Larger models are more accurate but slower and consume more resources.

  • Noise Reduction: If selected, noise in the audio is reduced before speech recognition, improving accuracy.

  • Speaker Identification: If selected, it will attempt to identify and differentiate speakers after speech recognition (accuracy is limited). The number below is a preset for how many speakers to identify. Setting it in advance can increase accuracy; the default is no limit.

  • Insert Speaker Tag: If selected, a speaker identifier (e.g., [spk0]) will be inserted at the beginning of each subtitle text.

  • Default Segmentation | Local Re-segmentation | LLM Re-segmentation: Choose between default segmentation, using a large language model for intelligent sentence segmentation and punctuation optimization, or a local algorithm based on punctuation and duration.

  • Output Format: Default is srt subtitle format. Options include txt, vtt, and ass.

  • Overall Recognition vs. Batch Inference: Overall Recognition uses built-in VAD (Voice Activity Detection) for better sentence segmentation. Batch Inference splits the audio based on a set maximum speech duration, then recognizes 16 segments simultaneously, which is faster but segmentation is slightly less precise.

  • Output Subtitles to Original Location: If selected, transcription results will be saved in the same folder as the original audio/video file.

  • Open Output Directory: Click this button to open the directory where transcription results are saved. Files are saved with the same name as the original audio/video file.

Feature: Multi-Role Voice Dubbing / Speech Synthesis per Subtitle

Supported subtitle or text format for dubbing: srt

Similar to the Batch Subtitle Dubbing feature, the difference is that this function allows assigning a different speaker to each line of subtitles, achieving multi-role dubbing.