An ideal translated video should have these features: accurate subtitles, appropriate length, voiceover tone matching the original, and perfect synchronization of subtitles, audio, and video. For ease of use, the software’s default settings aren’t optimal. You can follow the instructions below to adjust each step for the best configuration.
Step 1: Speech Recognition
Goal: Convert the speech in a video into a subtitle file in the corresponding language.
Corresponding Control Element: The "Speech Recognition" row

Best Configuration for Non-Chinese:
- Free:
open-whisper (local) large-v3model |faster-whisper (local) large-v3model - Paid: OpenAI Speech Recognition API
- Free:
Best Configuration for Chinese:
- Free: Qwen-ASR (local)
- Paid:
Doubao Speech Recognition Large Model (Speed Version)|Alibaba Bailian ASR
Best Configuration for Japanese:
- Free:
open-whisper (local) large-v3model, Huggingface_ASR ->reazon-research/japanese-wav2vec2-large-rs35kh - Paid: OpenAI Speech Recognition API
- Free:
Best Configuration for Less Common Languages:
- Free:
open-whisper (local) large-v3model - Paid:
Gemini Large Model Recognition|OpenAI Speech Recognition API
- Free:
Note: If you don't have an NVIDIA GPU or CUDA acceleration isn't configured, processing with local models will be very slow. It might crash if your GPU memory is insufficient.
Step 2: Subtitle Translation
Goal: Translate the subtitle file generated in step one into the target language.
Corresponding Control Element: The "Translation Channel" row

Best Configuration:
- Preferred AI Channel (Paid): DeepSeek, OpenAI ChatGPT (latest model), Gemini (latest model)
- Select
Send Full Subtitles
Step 3: Voiceover
- Goal: Generate a voiceover based on the translated subtitle file.
- Corresponding Control Element: The "Voiceover Channel" row

- Best Configuration:
- Free: Edge-TTS: Free and supports all languages.
- Free (Chinese, English, Japanese, Korean):
Qwen-TTS (local),F5-TTS/Index-TTS/GPT-SOVITS/CosyVoice (local) - Paid: Doubao Speech Synthesis 2.0 / Qwen-TTS (bailian) / 302.AI / Minimaxi / OpenAI-TTS
- Voice Cloning: OmniVoice-TTS (local), Qwen-TTS (local), GPT-SOVITS, CosyVoice, F5-TTS, Index-TTS, Chatterbox
Step 4: Synchronization of Subtitles, Voiceover, and Video
Goal: Synchronize the subtitles, voiceover, and video.
Corresponding Control Element: The
Synchronization Alignmentrow
Best Configuration:
- Select
Secondary Recognition. This will run speech recognition again on the voiceover file after it's created to generate subtitles with precise timestamps. - When translating from Chinese to English, you can set the
Voiceover Speedvalue (e.g.,10or15) to speed up the voiceover, as English sentences are often longer. - Select both
Voiceover Speed UpandVideo Slow Downoptions to force alignment of subtitles, audio, and video. You can also choose only one of them.
- Select
Step 5: Other Options to Improve Quality
- Select
Send Full Subtitles. Also selectMenu -> Tools -> Advanced Options -> AI Translation with Full Original Subtitles, and set theNumber of subtitle lines per batch for AI translation channelto 100 or higher. This will result in better translation quality. However, note that you must use an online AI large model with very large context, such as GPT-5.5+/Gemini-3.1-pro+/DeepSeek-v4, etc.
When using the clone role to clone the original voice tone for voiceover:
- If using CosyVoice/GPT-SoVITS/F5-TTS and other voiceover options, open
Menu -> Tools -> Advanced Settings -> Speech Recognition Parameters. It is recommended to setMinimum voice duration (ms)to 3000 andMaximum voice duration (seconds)to 10. This is because voice cloning will automatically use the original audio segment corresponding to the subtitle's duration as a reference audio. Most voiceover channels require this reference audio to be between 3-10 seconds, otherwise, the voiceover is likely to fail. Also, selectWhisper Pre-Split AudioandMerge Overly Short Subtitles with Adjacent Onesto ensure the subtitle duration falls within the 3-10 second range.
- If many of your subtitles are shorter than 3 seconds, it is recommended to use the
OmniVoice-TTSvoiceover channel. It can avoid errors with short reference audio. - Use an AI engine for the translation channel, such as DeepSeek or OpenAI ChatGPT, and select
Send Full Subtitles. - For the speech recognition channel, for Chinese, it is recommended to use
Doubao Speech Large Model (Speed Version)/Qwen-ASR/Alibaba Bailian, etc. For English, useFaster-whisper+ thelarge-v3model, and selectDefault Sentence Segmentation. - If you need to re-embed the original video's background sound, click
Set More Parametersand selectSeparate Voice and Background Sound. If not needed, selectNoise Reduction.
