Skip to content

An ideal translated video should have these features: accurate subtitles, appropriate length, voiceover tone matching the original, and perfect synchronization of subtitles, audio, and video. For ease of use, the software’s default settings aren’t optimal. You can follow the instructions below to adjust each step for the best configuration.

Step 1: Speech Recognition

  • Goal: Convert the speech in a video into a subtitle file in the corresponding language.

  • Corresponding Control Element: The "Speech Recognition" row
    image.png

  • Best Configuration for Non-Chinese:

    • Free: open-whisper (local) large-v3 model | faster-whisper (local) large-v3 model
    • Paid: OpenAI Speech Recognition API
  • Best Configuration for Chinese:

    • Free: Qwen-ASR (local)
    • Paid: Doubao Speech Recognition Large Model (Speed Version) | Alibaba Bailian ASR
  • Best Configuration for Japanese:

    • Free: open-whisper (local) large-v3 model, Huggingface_ASR -> reazon-research/japanese-wav2vec2-large-rs35kh
    • Paid: OpenAI Speech Recognition API
  • Best Configuration for Less Common Languages:

    • Free: open-whisper (local) large-v3 model
    • Paid: Gemini Large Model Recognition | OpenAI Speech Recognition API
  • Note: If you don't have an NVIDIA GPU or CUDA acceleration isn't configured, processing with local models will be very slow. It might crash if your GPU memory is insufficient.

Step 2: Subtitle Translation

  • Goal: Translate the subtitle file generated in step one into the target language.

  • Corresponding Control Element: The "Translation Channel" row
    image.png

  • Best Configuration:

    • Preferred AI Channel (Paid): DeepSeek, OpenAI ChatGPT (latest model), Gemini (latest model)
    • Select Send Full Subtitles

Step 3: Voiceover

  • Goal: Generate a voiceover based on the translated subtitle file.
  • Corresponding Control Element: The "Voiceover Channel" row
    image.png
  • Best Configuration:
    • Free: Edge-TTS: Free and supports all languages.
    • Free (Chinese, English, Japanese, Korean): Qwen-TTS (local), F5-TTS/Index-TTS/GPT-SOVITS/CosyVoice (local)
    • Paid: Doubao Speech Synthesis 2.0 / Qwen-TTS (bailian) / 302.AI / Minimaxi / OpenAI-TTS
    • Voice Cloning: OmniVoice-TTS (local), Qwen-TTS (local), GPT-SOVITS, CosyVoice, F5-TTS, Index-TTS, Chatterbox

Step 4: Synchronization of Subtitles, Voiceover, and Video

  • Goal: Synchronize the subtitles, voiceover, and video.

  • Corresponding Control Element: The Synchronization Alignment row
    image.png

  • Best Configuration:

    • Select Secondary Recognition. This will run speech recognition again on the voiceover file after it's created to generate subtitles with precise timestamps.
    • When translating from Chinese to English, you can set the Voiceover Speed value (e.g., 10 or 15) to speed up the voiceover, as English sentences are often longer.
    • Select both Voiceover Speed Up and Video Slow Down options to force alignment of subtitles, audio, and video. You can also choose only one of them.

Step 5: Other Options to Improve Quality

  1. Select Send Full Subtitles. Also select Menu -> Tools -> Advanced Options -> AI Translation with Full Original Subtitles, and set the Number of subtitle lines per batch for AI translation channel to 100 or higher. This will result in better translation quality. However, note that you must use an online AI large model with very large context, such as GPT-5.5+/Gemini-3.1-pro+/DeepSeek-v4, etc.

When using the clone role to clone the original voice tone for voiceover:

  1. If using CosyVoice/GPT-SoVITS/F5-TTS and other voiceover options, open Menu -> Tools -> Advanced Settings -> Speech Recognition Parameters. It is recommended to set Minimum voice duration (ms) to 3000 and Maximum voice duration (seconds) to 10. This is because voice cloning will automatically use the original audio segment corresponding to the subtitle's duration as a reference audio. Most voiceover channels require this reference audio to be between 3-10 seconds, otherwise, the voiceover is likely to fail. Also, select Whisper Pre-Split Audio and Merge Overly Short Subtitles with Adjacent Ones to ensure the subtitle duration falls within the 3-10 second range.
  2. If many of your subtitles are shorter than 3 seconds, it is recommended to use the OmniVoice-TTS voiceover channel. It can avoid errors with short reference audio.
  3. Use an AI engine for the translation channel, such as DeepSeek or OpenAI ChatGPT, and select Send Full Subtitles.
  4. For the speech recognition channel, for Chinese, it is recommended to use Doubao Speech Large Model (Speed Version)/Qwen-ASR/Alibaba Bailian, etc. For English, use Faster-whisper + the large-v3 model, and select Default Sentence Segmentation.
  5. If you need to re-embed the original video's background sound, click Set More Parameters and select Separate Voice and Background Sound. If not needed, select Noise Reduction.

View Advanced Options - More Fine-Tuning