Skip to content

An ideal translated video should have the following characteristics: accurate subtitles, appropriate length, voice-over tone consistent with the original audio, and perfect synchronization of subtitles, audio, and video.

This guide will detail the four steps of video translation and provide optimal configuration recommendations for each step.

Step 1: Speech Recognition

  • Goal: Convert the speech in the video into a subtitle file in the corresponding language.

  • Corresponding Control Element: "Speech Recognition" line image.png

  • Optimal Configuration:

    • Select faster-whisper(local)
    • Model selection large-v2, large-v3, or large-v3-turbo
    • Speech segmentation mode selection Whole recognition
    • Select Noise reduction (time-consuming)
    • Select Retain original background sound (time-consuming)
    • If the video is in Chinese, also select Chinese re-segmentation
  • Note: If there is no N-card or CUDA environment is not configured and CUDA acceleration is not enabled, the processing speed will be extremely slow. Insufficient VRAM may cause crashes.

Step 2: Subtitle Translation

  • Goal: Translate the subtitle file generated in step one into the target language.

  • Corresponding Control Element: "Translation Channel" line image.png

  • Optimal Configuration:

    • Priority Choice: If you have a VPN and understand how to configure it, use the gemini-1.5-flash model (Gemini AI channel) in Menu - Translation Settings - Gemini pro.
    • Suboptimal Choice: If you don't have a VPN or don't know how to configure a proxy, select OpenAI ChatGPT in "Translation Channel", and use the chagpt-4o series model in Menu - Translation Settings - OpenAI ChatGPT (requires third-party relay).
    • Alternative: If you can't find a suitable third-party relay, you can choose to use domestic AI such as Moon's Dark Side, deepseek, etc.
    • In Menu - Tools/Options - Advanced Options, select the two items shown in the following figure: image.png

    GeminiAI usage instructions https://pyvideotrans.com/gemini.html

Step 3: Voice-over

  • Goal: Generate a voice-over based on the translated subtitle file.

  • Corresponding Control Element: "Voice-over Channel" line image.png

  • Optimal Configuration:

    • Chinese or English: F5-TTS(local), voice role selection clone
    • Japanese and Korean: CosyVoice(local), voice role selection clone
    • Other languages: clone-voice(local), voice role selection clone
    • All three channels can maximize the retention of the original video's emotional color, with F5-TTS providing the best effect.

    Requires additional installation of the corresponding F5-TTS/CosyVoice/clone-voice integration package, see documentation https://pyvideotrans.com/f5tts.html

Step 4: Subtitle, Voice-over, and Video Synchronization

  • Goal: Synchronize the subtitles, voice-over, and video.
  • Corresponding Control Element: Synchronization line image.png
  • Optimal Configuration:
    • When translating Chinese into English, you can set the Voice-over speed value (e.g., 10 or 15) to speed up the voice-over, as English sentences are usually longer.
    • Select the Video extension, Voice-over acceleration, and Video slowdown options to force alignment of subtitles, audio, and video.
    • In Menu - Tools/Options - Advanced Options - Subtitle Audio Video Alignment Area, make the following settings: image.png
    • Maximum audio acceleration factor and Video slowdown factor can be adjusted according to the actual situation (default is 3).

    It is recommended to fine-tune whether each option is selected and its value based on the actual speaking speed in the video.

Output Video Quality Control

  • The default output is lossy compression. For lossless output, in Menu - Tools - Advanced Options - Video Output Control Area, set Video transcoding loss control to 0: image.png
  • Note: If the original video is not in mp4 format or uses embedded hard subtitles, video encoding conversion will cause some loss, but the loss is usually negligible. Improving video quality will significantly reduce processing speed and increase the size of the output video.