Skip to content

Translate Video & Audio

When you open the software, it defaults to the Translate Video & Audio workspace, which is its most core feature.

The basic workflow is: Select the source video to translate -> Choose the model to use -> Select the source language and target language -> Choose the text translation service -> Select the dubbing service and voice actor -> Then start processing.

The following steps will guide you through a complete video and audio translation task.

Step 1: Select the Video to Translate

Supported video formats: mp4/mov/avi/mkv/webm/mpeg/ogg/mts/ts

Supported audio formats: wav/mp3/m4a/flac/aac

  • Select Audio/Video: Click this button to select one or more audio/video files for translation (hold Ctrl for multi-select).

  • Folder: Check this box to batch process all videos within an entire folder.

  • Clean Generated: Check this box to reprocess the same video (instead of using cached results).

  • Output to...: By default, translated files are saved to the _video_out folder in the original video's directory. Click this button to set a custom output directory for the translated videos.

  • Shutdown after completion: Automatically shuts down the computer after processing all tasks, ideal for large batch or long-duration tasks.

Step 2: Speech Recognition Service

  • Speech Recognition: Used to transcribe speech from audio or video into subtitle files. The quality of this step directly determines the final outcome. Supports over a dozen different recognition methods.

  • faster-whisper (Local): This is a local model (needs online download on first run), offering good speed and quality. It's a good choice if you have no special requirements. It comes in about a dozen different sizes. The smallest, fastest, and most resource-efficient model is tiny, but its accuracy is very low and is not recommended. The best quality model is large-v3. Models ending in .en or starting with distil- are for English-only audio.

  • openai-whisper (Local): Similar to the model above but slower. It might offer slightly higher accuracy. Again, the large-v3 model is recommended.

  • qwen-asr (Local): Alibaba's local recognition model, which works well for Chinese. If your source video is in Chinese, try this one. It also requires an online download on first run.

  • Noise Reduction: If selected, it will download Alibaba's model from modelscope.cn online before speech recognition to denoise the audio, which can improve recognition accuracy.

  • Secondary Recognition: When dubbing is selected and embedded single subtitles are chosen, this option can be enabled. After dubbing, it will perform speech recognition again on the dubbed file to generate shorter subtitles embedded in the video, ensuring precise alignment between subtitles and the dub.

  • Default Sentence Segmentation vs. LLM Re-segmentation: LLM re-segmentation sends the transcribed subtitle text to an AI large model to fine-tune misspellings and awkward sentence breaks, aiming for smoother, more natural results. Requires configuring a DeepSeek or OpenAI ChatGPT translation service (configurable via Menu -> Tools -> Advanced Options -> General Settings). Note that using LLM re-segmentation can sometimes worsen the results, as its effectiveness depends on the AI model's intelligence. When cloning the original voice (i.e., the dubbing role is clone), this method is not recommended; stick with the default.

  • Also supports various online APIs and local models like ByteDance Volcano Subtitle Generation, OpenAI Speech Recognition, Gemini Speech Recognition, and Alibaba Qwen3-ASR.

Click here to view all supported speech recognition services

Step 3: Translation Service

Translation Service: Used to translate the transcribed source language subtitle file into the target language subtitle file. A dozen different translation services are available.

  • Free Traditional Translation: Google Translate (requires proxy), Microsoft Translate (no proxy needed), M2M100 Local Translation, DeepLX (requires self-deployment).
  • Paid Traditional Translation: Baidu Translate, Tencent Translate, Alibaba Machine Translation, DeepL.
  • AI Intelligent Translation: OpenAI ChatGPT, Gemini, DeepSeek, Claude, Zhipu AI, SiliconFlow, 302.AI, etc. You need to provide your own API keys and fill them in (Menu -> Translation Settings -> Corresponding Service Settings Panel).
  • Compatible AI/Local Model: Also supports self-hosted local large models. Just select the Compatible AI/Local Model service and enter the API address in the Menu -> Translation Settings -> Local Model Settings.
  • Source Language: The language spoken by people in the source video. Make sure to select the correct one. If unsure, choose auto.
  • Target Language: The language you want the audio/video translated into.
  • Translation Glossary: Terms sent to the AI during AI translation.
  • Send Full Subtitles: Sends subtitles along with line numbers and timestamps to the AI during AI translation. It is recommended to select this option when using AI translation services.

Click here to view all supported translation services

Step 4: Dubbing Service

Dubbing Service: The translated subtitle file will be voiced using the service specified here. Supports online TTS APIs like Qwen-TTS/Edge-TTS/Elevenlabs/Minimaxi, as well as self-hosted open-source TTS models. Edge-TTS is a free dubbing service, ready to use out of the box. Some services require configuration via Menu -> TTS Settings -> Corresponding Service Panel.

  • Dubbing Role: Each dubbing service usually offers multiple speakers. First select the target language, then you can choose a dubbing role.
  • Preview Dub: After selecting a dubbing role, you can click to preview the voice effect.

Selecting clone as the dubbing role means the software will attempt to use the original video's voice for dubbing.

Click here to view all supported dubbing services

Step 5: Synchronization & Subtitles

Root Cause of Desynchronization After Video Translation

When translating one language into another and dubbing it, the duration of the spoken audio inevitably changes due to different syllable counts and grammatical structures. This leads to desynchronization between subtitles, audio, and video, which is normal.

This manifests as: the person in the original video stops speaking, but the dub has only played half; or the original person is still speaking, but the dub finishes early.

To address this, you can adjust by speeding up the audio or slowing down the video to a certain degree.

The primary adjustment is for cases where the dub duration is longer than the original duration, to prevent overlapping audio. No adjustment is made if the dub is shorter.

  • Audio Speedup: If a dubbed segment is longer than the original audio segment, speed up the dub to match the original duration.
  • Video Slowdown: Similarly, if a dubbed segment is longer than the video segment, slow down the video playback speed for that segment to match the dub duration. (If selected, processing will be time-consuming and will generate many intermediate segments. To minimize quality loss, the overall file size might increase several times compared to the original video.)
  • No Subtitle Embedding: Only replace the audio; no subtitles are added.
  • Embed Hard Subtitles: Permanently "burn" subtitles into the video frame. They cannot be turned off and will always be displayed regardless of the player.
  • Embed Soft Subtitles: Embeds subtitles as a separate track in the video. The player can choose to turn them on or off. Note: soft subtitles usually cannot be displayed when playing in a web browser.
  • (Dual Subtitles): Each subtitle line consists of two rows: the original language subtitle and the target language subtitle.
  • Network Proxy: For users in mainland China accessing foreign services like Google, Gemini, or OpenAI, a proxy is needed. If you have a VPN and know the proxy port, enter it here, e.g., http://127.0.0.1:7860.

Click here to view the principles of dubbing, subtitle, and video synchronization in video translation

Step 6: Start Execution

  • CUDA Acceleration: On Windows and Linux, if you have an NVIDIA GPU with CUDA correctly installed, make sure to check this box. It can increase speech recognition speed by several times or even dozens of times.

If you have multiple NVIDIA GPUs, go to Menu -> Tools -> Advanced Options -> General Settings and select Multi-GPU Mode. The software will attempt to process tasks using multiple GPUs in parallel. Click here to configure the CUDA acceleration environment

Once all settings are done, click the Start Execution button.

Executing

If you select multiple audio/video files for translation at once, they will be processed simultaneously, without pausing in between.

When selecting only one video at a time, after speech transcription is complete, a dedicated subtitle editing window will pop up. You can modify subtitles here for more accurate subsequent processing. Click to view details.
  • 1st Editing Chance: After the speech recognition phase, the subtitle editing window pops up.

  • 2nd Editing Chance: After the subtitle translation phase is complete, the subtitle editing and voice role editing window pops up.

  • 3rd Editing Chance: After dubbing is complete, you can review or re-dub individual subtitle lines.

  • 4th Editing Chance: If you selected Secondary Recognition and have dubbing enabled, after secondary recognition, the subtitle editing window will pop up again, allowing you to fix typos, etc.

Step 7: Progress Bar

After the task is complete, click on the progress bar area at the bottom to open the output folder. You will find the final MP4 file, as well as interim files like SRT subtitles and dubbed audio.

Step 8: Set More Parameters

If you need finer control over parameters like speech speed, volume, maximum characters per subtitle line, noise reduction, or speaker diarization, click Set More Parameters.... The following interface appears:

  • Recognize Speakers: If selected, the software will attempt to identify and differentiate speakers after speech recognition (accuracy is limited). The number next to it represents a preset number of speakers to identify. Setting it in advance can improve accuracy. Default is no limit. The speaker model can be changed in Advanced Options (built-in, Alibaba cam++, pyannote, etc.).

  • Dubbing Speed: Default is 0. A value of 50 increases speed by 50%, while -50 decreases speed by 50%.

  • Volume+: Default is 0. A value of 50 increases volume by 50%, while -50 decreases volume by 50%.

  • Pitch+: Default is 0. A value of 20 increases pitch by 20Hz (making it sharper), while -20 decreases it by 20Hz (making it deeper).

  • Voice Activity Threshold: The minimum probability for an audio segment to be considered speech. VAD calculates a speech probability for each segment. Parts exceeding this threshold are treated as speech, others as silence/noise. Default is 0.5. Lower values are more sensitive but may misclassify noise as speech.

  • Min Speech Duration (ms): Limits the minimum duration of a speech segment. If you have selected voice cloning, keep this value >= 3000.

  • Max Speech Duration (s): Limits the maximum length of a single speech segment. Segments longer than this are forcibly split. Unit: seconds. Default is 6 seconds, do not exceed 30 seconds.

  • Silence Cut Duration (ms): The required duration of silence after speech ends before a speech segment is cut. Unit: ms. Default is 500ms. Speech is only split at silence gaps longer than this value.

  • Batch Size for Traditional Translation Services: Number of subtitle lines sent per batch to traditional translation services.

  • Batch Size for AI Translation Services: Number of subtitle lines sent per batch to AI translation services.

  • Send Full Subtitles: Whether to send the complete subtitle format content when using AI translation services.

  • Pause After Translation (s): Pause duration after each translation batch, used to limit request frequency.

  • Pause After Dubbing (s): Pause duration after each dubbing request, used to limit request frequency.

  • Max Line Length (CJK): Maximum number of characters per subtitle line for CJK (Chinese, Japanese, Korean) languages when embedding subtitles in the video.

  • Max Line Length (Other): Maximum number of characters per subtitle line for non-CJK languages when embedding subtitles in the video.

  • Edit Hard Subtitle Style: Click to open a dedicated hard subtitle style editor.

  • Separate Vocals/Background: If selected, the software will separate the background music/noise from the speech in the video. (This is a pure CPU operation and can be slow.)

  • Embed Background Audio: When this is selected, the separated background audio will be mixed back into the final dubbed video.

  • Loop Background Audio: If the background audio duration is shorter than the final video duration, selecting this will loop the background audio; otherwise, silence will fill the remaining part.

  • Background Audio Volume: Sets the volume level for the re-embedded background audio. Default is 0.8, meaning the volume is reduced to 0.8 times its original level.

  • Add Extra Background Audio: You can also choose a local audio file to use as a new background track.

  • Restore Punctuation: If selected, the software will attempt to add punctuation marks after recognition.

Click here for usage instructions on various parameters in Menu -> Tools -> Advanced Options