Skip to content

This is an open-source and free project. Due to limited resources, the documentation may lag behind the actual software interface. Please refer to the software for the most accurate information.

This is a powerful, open-source video translation / speech transcription / speech synthesis tool. It seamlessly converts a video from one language into a video with dubbing and subtitles in another language.

Core Features at a Glance

  • Fully Automatic Video & Audio Translation: Intelligently recognizes and transcribes speech from your audio/video, generates source language subtitles, translates them into target language subtitles, creates dubbing, and merges everything back into the original video.
  • Speech-to-Text / Audio/Video to Subtitles: Accurately transcribes speech from batches of video or audio files into SRT subtitle files with timestamps.
  • Text-to-Speech (TTS) / Dubbing: Utilizes multiple advanced TTS engines to generate high-quality, natural-sounding dubbing for your text or SRT subtitle files.
  • SRT Subtitle File Translation: Supports batch translation of SRT subtitle files while preserving original timecodes and formatting, offering various bilingual subtitle styles.
  • Script Alignment & Timing: Aligns existing text scripts with audio/video to create precisely timed SRT subtitles.
  • Real-time Speech-to-Text: Supports real-time microphone monitoring to transcribe spoken words into text.

How the Software Works

Before you begin, it's crucial to understand the core workflow of this software.

The software first uses a Speech Recognition engine to transcribe human speech from an audio or video file into a subtitle file. Then, it uses a Translation engine to translate this subtitle file into your desired target language. Next, it uses the selected TTS/Dubbing engine to create audio from the translated subtitles. Finally, it embeds and synchronizes the new subtitles, audio, and original video to complete the video translation process.

A complete video translation goes through 4 stages: [Speech-to-Text] -> [Subtitle Translation] -> [Dubbing] -> [Synthesis]

  • Speech-to-Text: Transcribes speech from the video into subtitle text. This is done by the Speech Recognition engine.
  • Subtitle Translation: Translates the transcribed subtitle text into the target language. This is done by the Translation engine.
  • Dubbing: Converts the translated subtitle text into audio. This is done by the Dubbing engine.
  • Handles: Any audio/video containing human speech, even without embedded subtitles.
  • Cannot Handle: Videos with only background music and hardcoded subtitles (burned into the video frames) where no one is speaking. The software cannot extract hardcoded subtitles from video frames. The software cannot remove existing hardcoded subtitles from the original video.

Download & Installation

1.0 macOS / Linux Users (Source Code)

For macOS and Linux users, please deploy using the source code.

1.1 Windows Users (Pre-packaged EXE)

We offer a ready-to-use pre-packaged version for Windows 10/11 users. No complex configuration is needed; just download, extract, and run.

Click to download the Windows pre-packaged version, ready to use after extraction

Do NOT double-click sp.exe while it's still inside the compressed archive. This will certainly cause errors.

Please note the following before use to avoid most errors.

  1. Do not extract to system folders requiring special permissions, such as C:/Program Files or C:/Windows.
  2. It is recommended to extract the software into a folder containing only English letters and numbers, e.g., D:/videotrans. Then extract the archive into this folder. The path should not be too deep.
  3. Keep the filenames of the videos you want to translate short (e.g., under 30 characters). Very long filenames (hundreds of characters), combined with the path and other commands, might exceed the system limit on Windows and cause errors.

Also, ensure filenames do not contain special symbols like ":?*. This is especially important for videos downloaded from YouTube, which often have extremely long filenames containing various special characters. Using them directly without modification is likely to cause various errors on Windows. It is recommended to rename them to short names and remove any special characters.

  1. On Windows, please enable "Show file name extensions" (they are hidden by default). This can prevent some errors, especially when dealing with reference audio files.

Open any folder, click View -> Show -> File name extensions and check it. Once checked, an .mp4 video will show .mp4 at the end of its name, and a .wav audio will show .wav.

Starting the Software

After extraction, enter the folder and double-click the sp.exe file to run it.

The software needs to load many modules upon the first launch. This may take tens of seconds or even 2-3 minutes. Please be patient.

Software Interface

After launching the software, you will see the main interface below. Click Set More Parameters to reveal detailed configuration options.

What does "Free / Local API (External) / Local Built-in" mean?
  • Free: For example, Google Translate, Microsoft Translator, and Edge-TTS dubbing. These are free online services that work out-of-the-box (no API key needed for some). Be aware of rate limits; high-frequency use might cause errors.
  • Local API (External): Many open-source models can be deployed locally on your machine. After deployment, enter the API or WebUI address in pyVideoTrans's settings. The software can then call your deployed model service (e.g., GPT-SoVITS, CosyVoice, F5-TTS).
  • Local Built-in: Some models are integrated directly into pyVideoTrans, eliminating the need for separate deployment. These are ready to use out-of-the-box (e.g., VITS, Piper, Qwen3-TTS, Qwen3-ASR, SuperionTTS, ChatterBox). However, to prevent the software installer from being too large, the model files are not included. You need to download them online the first time you use them. Updates may be checked on subsequent uses, requiring an internet connection. For a pure offline experience, you would need to deploy the software from source code.

Model download sources are huggingface.co (blocked in some regions) / hf-mirror.com (mirror in China) / modelscope.cn (Alibaba ModelScope in China) / github.com (Microsoft's repository).

Core Feature: Translate Video & Audio

When you open the software, the default screen is the "Translate Video and Audio" workspace. This is the software's most essential feature.

The basic workflow is: Select the original video -> Choose the model -> Select the source language and target language -> Choose the text translation engine -> Choose the dubbing engine and role -> Start execution.

The following steps will guide you through a complete video/audio translation task.

Row 1: Select the Video to Translate

Supported video formats: mp4/mov/avi/mkv/webm/mpeg/ogg/mts/ts

Supported audio formats: wav/mp3/m4a/flac/aac

  • Select Audio or Video: Click this button to select one or more video/audio files for translation (hold Ctrl for multiple selection).

  • Folder: Check this box to batch process all files within an entire folder.

  • Clean Generated Files: If you need to reprocess the same video (instead of using cached results), check this box.

  • Output to...: By default, the translated file is saved in the _video_out subfolder within the original video's directory. Click this button to set a custom output directory.

  • Shutdown After Completion: Automatically shuts down your computer after processing all tasks. Ideal for large, lengthy batch jobs.

Row 2: Speech Recognition Engine

  • Speech Recognition: This engine transcribes speech from audio or video into subtitle files. The quality of this step directly impacts the final result. Over a dozen different recognition options are supported.

  • faster-whisper (Local): A local model (requires an online download on first run). It offers a good balance of speed and quality. If you have no special requirements, choose this. It offers various model sizes. The smallest and fastest is tiny, but it has low accuracy. The best quality is large-v3. Models ending with .en or starting with distil- only support English speech.

  • openai-whisper (Local): Similar to the one above, but generally slower. Accuracy might be slightly higher. Again, large-v3 is recommended.

  • qwen-asr (Local): Alibaba's local recognition model. It performs well for Chinese speech. If your original video is in Chinese, you can try this. It also requires an online download on first use.

  • Noise Reduction: If checked, the software will download Alibaba's model from modelscope.cn before recognition to remove noise from the audio. This can improve recognition accuracy.

  • Second Recognition : When dubbing is selected and single subtitles are embedded, checking this option will transcribe the dubbed audio again after dubbing is complete. This generates shorter subtitles for embedding, ensuring precise alignment between subtitles and dubbing.

  • Default Segmentation vs LLM Re-segmentation: LLM Re-segmentation sends the transcribed text to an AI large model (like DeepSeek or OpenAI ChatGPT) to correct errors and adjust awkward sentence breaks for smoother results. You need to configure the chosen AI engine in the menu Tools -> Advanced Options -> General Settings. Be aware that LLM re-segmentation can sometimes produce worse results, depending on the AI model's intelligence. It is not recommended when using voice cloning (i.e., the dubbing role is clone). In that case, stick with Default Segmentation.

  • The software also supports other online APIs and local models like ByteDance Volcano (Bitable), OpenAI Speech-to-Text, Gemini Speech-to-Text, and Alibaba Qwen3-ASR.

Click to view all supported Speech Recognition engines

Row 3: Translation Engine

Translation Engine: This engine translates the transcribed original language subtitle file into your target language. The software includes dozens of built-in translation engines.

  • Free Traditional Translation: Google Translate (requires VPN/proxy), Microsoft Translator (no proxy needed, for now), M2M100 (local), DeepLX (requires self-deployment).
  • Paid Traditional Translation: Baidu Translate, Tencent Translate, Alibaba Machine Translation, DeepL.
  • AI Smart Translation: OpenAI ChatGPT, Gemini, DeepSeek, Claude, Zhipu AI, SiliconFlow, 302.AI, etc. You need to provide your own API keys in Menu -> Translation Settings -> Corresponding Engine Settings.
  • Compatible AI / Local Models: Also supports self-deployed local LLMs. Choose the Compatible AI / Local Model engine and enter the API address in Menu -> Translation Settings -> Local LLM Settings.
  • Source Language: The language spoken in the original video. This must be selected correctly. If you are unsure, choose auto.
  • Target Language: The language you want to translate the video/audio into.
  • Translation Glossary: A glossary sent to the AI during translation to ensure specific terms are translated correctly.
  • Send Full Subtitles: When using AI translation engines, checking this option sends the full subtitle content (including line numbers and timestamps) to the AI for better context.

Click to view all supported Translation engines

Row 4: Dubbing Engine

Dubbing Engine: The translated subtitle file will be dubbed using the engine selected here.

It supports online TTS APIs (e.g., Qwen-TTS, Edge-TTS, Elevenlabs, Minimaxi) and locally deployed open-source TTS models. Edge-TTS is a free TTS engine that works out-of-the-box. Some engines require configuration via Menu -> TTS Settings -> Corresponding Engine Panel.

  • Dubbing Role: Each engine typically offers multiple voices. First, select the target language, and then you can choose a dubbing role.
  • Preview Dubbing: After selecting a dubbing role, click this button to listen to a sample of the voice.

Selecting clone as the dubbing role means the software will attempt to clone the original speaker's voice from the source video for dubbing. Click to view all supported Dubbing engines

Row 5: Synchronization & Alignment and Subtitles

The Root Cause of Desynchronization After Translation

When a language is translated into another and dubbed, the dubbing duration will inevitably change due to differences in syllable count and grammatical structure. This naturally leads to desynchronization between subtitles, dubbing, and video frames.

Example: The original speaker has finished speaking, but the dubbing has only played halfway; or the original speaker is still talking, but the dubbing has already ended.

To address this, you can use the settings below to speed up the audio or slow down the video for a degree of adjustment.

Adjustments are primarily made when the dubbing duration is longer than the original segment's duration, to prevent overlapping speech. Cases where the dubbing is shorter are generally not adjusted.

  • Audio Speedup: If a dubbed segment is longer than the original audio, the dubbing will be sped up to match the original duration.
  • Video Slowdown: Similarly, if a dubbed segment is longer, the video playback speed for that segment will be slowed down to match the dubbing duration. (If selected, this process can be time-consuming and generate many intermediate segments, potentially resulting in a file significantly larger than the original.)
  • No Subtitles: Only replaces the audio, without adding any subtitles.
  • Burn Hard Subtitles: Permanently "burns" the subtitles onto the video frames. They cannot be turned off and will always be displayed.
  • Embed Soft Subtitles: Embeds subtitles as a separate track. Playback software can toggle them on/off, but they may not display in web players.
  • (Dual/Bilingual): Each subtitle line consists of two rows: the original language text and the target language text.
  • Network Proxy: For users in mainland China accessing foreign services like Google, Gemini, or OpenAI, a proxy is needed. If you have a VPN or similar service and know the proxy port number, enter it here in the format http://127.0.0.1:7860.

Click to learn more about syncing and aligning dubbing, subtitles, and video

Row 6: Start Execution

  • CUDA Acceleration: On Windows and Linux, if you have an NVIDIA GPU with CUDA properly installed, make sure to check this box. It can speed up speech recognition by several times or even tens of times.

If you have multiple NVIDIA GPUs, go to Menu -> Tools -> Advanced Options -> General Settings and check Multi-GPU Mode. The software will attempt to use them in parallel. Click here for a guide on setting up the CUDA acceleration environment

Once everything is set, click the Start Execution button.

Processing

If you select multiple audio/video files for translation, they will be processed simultaneously without pausing between them.

If you select only one video, a subtitle editing window will pop up after the speech recognition phase. You can modify the subtitles here for better accuracy in subsequent steps. Click for details.
  • Opportunity 1: After the speech recognition phase, a subtitle editing window appears.

After the subtitle translation phase, a window allows you to assign different dubbing roles to each speaker or even specify a role for each individual subtitle line.

  • Opportunity 2: After the subtitle translation phase, a window for editing subtitles and modifying dubbing roles appears.

  • Opportunity 3: After dubbing is complete, you can review and re-dub each subtitle line if needed.

  • Opportunity 4: If you enabled "Second Recognition" and dubbing was performed, a subtitle editing window will appear again after the second recognition, allowing you to fix any typos, etc.

Row 7: Progress Bar

After the task is complete, click on the progress bar area at the bottom to open the output folder. You will find the final MP4 file, along with intermediate files like SRT subtitles and dubbing audio.

Row 8: Set More Parameters

If you need finer control over the process—such as speech rate, volume, maximum characters per subtitle line, noise reduction, or speaker diarization—click Set More Parameters.... The expanded panel looks like this:

  • Identify Speakers (Speaker Diarization): If checked, the software will attempt to distinguish between different speakers after recognition (accuracy is limited). The number indicates the preset number of expected speakers. Specifying it can improve accuracy; the default is no limit. In advanced options, you can switch between speaker models (built-in, Alibaba cam++, pyannote, etc.).

  • Dubbing Speed: Default is 0. A value of 50 means 50% faster, -50 means 50% slower.

  • Volume +: Default is 0. 50 means 50% louder, -50 means 50% quieter.

  • Pitch +: Default is 0. 20 means the pitch is raised by 20 Hz (sharper), -20 means lowered by 20 Hz (deeper).

  • Voice Activity Detection (VAD) Threshold: The minimum probability that an audio segment is considered speech. VAD calculates a probability for each segment. Segments above this threshold are treated as speech; those below are treated as silence or noise. Default is 0.5. A lower value is more sensitive but might mistake noise for speech.

  • Min Duration of Speech Segment (ms): The minimum duration for a speech segment. If you have selected voice cloning (clone), keep this value >= 3000.

  • Max Duration of Speech Segment (s): The maximum duration for a single speech segment. Segments exceeding this length are forcibly split. In seconds. Default is 6. Do not set it higher than 30.

  • Silence Duration for Segment Splitting (ms): The required duration of silence at the end of a speech segment before it is split. In milliseconds. Default is 500. Segments will only be split at silence gaps longer than this value.

  • Batch Size for Traditional Translation Engines: Number of subtitle lines sent per batch to traditional translation services.

  • Batch Size for AI Translation Engines: Number of subtitle lines sent per batch to AI translation services.

  • Send Full Subtitles: Whether to send the complete subtitle format (including timestamps) when using AI translation engines.

  • Pause After Translation (s): Pause duration after each batch translation request. Used to limit request frequency for API calls.

  • Pause After Dubbing (s): Pause duration after each dubbing request. Used to limit request frequency.

  • Max Characters per Line (CJK): The maximum number of characters per subtitle line for Chinese, Japanese, and Korean languages when embedding subtitles into the video.

  • Max Characters per Line (Non-CJK): The maximum number of characters per subtitle line for languages other than CJK.

  • Modify Hard Subtitle Style: Click to open a dedicated hard subtitle style editor.

  • Separate Vocals and Background: If checked, the software will attempt to separate the background music/accompaniment from the speech. (This is CPU-intensive and relatively slow.)

  • Embed Background Audio: If checked, the separated background audio will be re-embedded into the final video during the merge step.

  • Loop Background Audio: If checked and the background audio is shorter than the final video, the background audio will loop. Otherwise, the remaining duration will be filled with silence.

  • Background Volume: Sets the volume for the re-embedded background audio. Default is 0.8 (i.e., 80% of the original volume).

  • Add Extra Background Audio: You can also select a local audio file to use as a new background track.

  • Restore Punctuation: If checked, the software will attempt to add punctuation (e.g., periods, commas) after recognition.

Click to read the guide for Menu -> Tools -> Advanced Options parameters