
This is a powerful open-source video translation/speech transcription/speech synthesis software, dedicated to seamlessly converting videos from one language to another with dubbed audio and subtitles.
Core Features Overview
- Fully Automated Video Translation, Audio Translation: Intelligently recognizes and transcribes speech in audio/video, generates source language subtitle files, translates them into target language subtitle files, then performs dubbing, and finally merges the new audio and subtitles with the original video in one seamless process.
- Speech Transcription/Audio-Video to Subtitles: Accurately transcribes human speech from video or audio files into SRT subtitle files with timestamps in batch.
- Speech Synthesis/Text-to-Speech (TTS): Utilizes multiple advanced TTS channels to generate high-quality, natural-sounding voiceovers for your text or SRT subtitle files.
- SRT Subtitle File Translation: Supports batch translation of SRT subtitle files, preserving original timestamps and formats, and offers various bilingual subtitle styles.
How the Software Works
Before starting, it's essential to understand the core workflow:
First, the human speech in the audio or video is transcribed into subtitle files via a [Speech Recognition Channel]. Then, these subtitle files are translated into the specified target language subtitles via a [Translation Channel]. Next, the translated subtitles are used to generate dubbed audio using the selected [Dubbing Channel]. Finally, the subtitles, audio, and original video are embedded and synchronized to complete the video translation process.
- Can Process: Any audio/video containing human speech, regardless of whether it has embedded subtitles.
- Cannot Process: Videos containing only background music and hardcoded subtitles but no human speech. This software also cannot directly extract hardcoded subtitles from the video frames.
Download & Installation
1.1 Windows Users (Pre-packaged Version)
We provide a ready-to-use pre-packaged version for Windows 10/11 users, requiring no complex configuration—just download, extract, and use.
Click here to download the Windows pre-packaged version, extract and use
Extraction Notes
Do NOT double-click
sp.exedirectly inside the compressed archive; this will definitely cause errors. Incorrect extraction is the most common reason for the software failing to start. Please strictly follow these rules:
- Avoid Administrator Permission Paths: Do NOT extract to system folders like
C:/Program Files,C:/Windows, etc. - Use Simple Paths with English/Numbers: The extraction path must not contain any Chinese characters, spaces, or special symbols. The storage path should also not be too deep.
- Recommended Practice: Create a new folder using only English letters or numbers (e.g.,
D:/videotrans) on a non-system drive like D: or E:, then extract the compressed package into this folder.
Launching the Software
After extraction, navigate into the folder, find the sp.exe file, and double-click to run it. 
The first launch requires loading many modules and may take several dozen seconds; please be patient.
1.2 MacOS / Linux Users (Source Code Deployment)
For MacOS and Linux users, deployment via source code is required.
- Source Code Repository: https://github.com/jianchang512/pyvideotrans
- Detailed Deployment Tutorials:
Software Interface & Core Functions
After launching the software, you will see the main interface as shown below.

Top Function Area: Switch between the main functional modules of the software, such as Translate Video & Audio, Transcribe & Translate Subtitles, Audio/Video to Subtitles, Batch Translate SRT Subtitles, Batch Dubbing for Subtitles, Multi-Role Subtitle Dubbing, Batch Convert Subtitle Formats, Merge Audio/Video/Subtitles, etc.

Top Menu Bar: For global configuration.

Translation Settings: Configure API Keys and related parameters for various translation channels (e.g., OpenAI, Azure, DeepSeek).

TTS Settings: Configure API Keys and related parameters for various dubbing channels (e.g., OpenAI TTS, Azure TTS).

Speech Recognition Settings: Configure API Keys and parameters for speech recognition channels (e.g., OpenAI API, Alibaba ASR).

Tools/Options: Contains various advanced options and auxiliary tools, such as subtitle format adjustment, video merging, voice separation, etc.

Help/About: View software version information, documentation, and community links.

Function: Translate Video & Audio
The default workspace shown when opening the software is the Translate Video & Audio workspace, which is also the core function. We will guide you step-by-step through a complete video/audio translation task.

Step 1: Select the Video to Translate
Supported Video Formats:
mp4/mov/avi/mkv/webm/mpeg/ogg/mts/tsSupported Audio Formats:
wav/mp3/m4a/flac/aac

Select Audio or Video: Click this button to select one or multiple video/audio files for translation (holdCtrlto select multiple).Folder Checkbox: Check this option to batch process all videos within an entire folder.Delete Generated Checkbox: Check this option if you need to reprocess the same video (instead of using cache).Save to..: By default, translated files are saved to the_video_outfolder in the original video's directory. Click this button to set a separate output directory for translated videos.Save Video Only Checkbox: When checked, only the final MP4 video is retained after processing; intermediate files like subtitles and audio are automatically deleted.Shutdown After Completion: Automatically shuts down the computer after all tasks are processed, suitable for large-scale, long-duration tasks.
Step 2: Select Translation Channel, Dubbing Channel, Speech Recognition Channel
Translation Channel

Translation Channel: The translation channel is used to translate the transcribed original language subtitle file into the target language subtitle file. Over a dozen built-in translation channels are available.
Free Traditional Translation: Google Translate (requires proxy), Microsoft Translator (no proxy needed), DeepLX (requires self-deployment)
Paid Traditional Translation: Baidu Translate, Tencent Translate, Alibaba Machine Translation, DeepL
AI Smart Translation: OpenAI ChatGPT, Gemini, DeepSeek, Claude, Zhipu AI, Silicon Flow, 302.AI, etc., require your own SK Key filled in the
Menu - Translation Settings - Corresponding Channel Settings Panel.Compatible AI/Local Models: Also supports self-deployed local large models. Just select the Compatible AI/Local Model channel and fill in the API address in Menu - Translation Settings - Local Large Model Settings.
Source Language: Refers to the language spoken by people in the original video. Must be selected correctly. If unsure, choose
auto.Target Language: The language you want the audio/video to be translated into.
Glossary: Click to configure a glossary for replacing specific terms during subtitle translation, ensuring accuracy of professional vocabulary.
Send Complete Subtitles: Only effective when using AI translation channels. If selected, the complete subtitle format is sent to the AI for translation, resulting in better translation quality but requiring a larger AI model, such as online commercial AI models. For local models, selecting this might cause format errors, prompts in the results, or excessive blank lines.
Network Proxy: If using Google Translate, OpenAI, Gemini, etc., which are inaccessible directly from within China, you must use a proxy via scientific internet access. Enter your proxy address and port here (e.g.,
http://127.0.0.1:10808).
Dubbing Channel

Dubbing Channel: The translated subtitle file will be dubbed using the channel specified here. Supports online dubbing APIs like OpenAI TTS / Alibaba Qwen-TTS / Edge-TTS / Elevenlabs / ByteDance Volcano Voice Synthesis / Azure-TTS / Minimaxi, etc., and also supports locally deployed open-source TTS models like IndexTTS2 / F5-TTS / CosyVoice / ChatterBox / VoxCPM, etc. Among these, Edge-TTS is a free dubbing channel, ready to use out-of-the-box. For channels requiring configuration, fill in the relevant information in Menu -- TTS Settings -- Corresponding Channel Panel.
- Dubbing Role: Each dubbing channel generally offers multiple speakers to choose from. After selecting the Target Language, you can then select a dubbing role.
- Preview Dubbing: After selecting a dubbing role, click to preview the voice effect of the current role.
- Dubbing Speed+/Volume+/Pitch+: Adjust as needed. The values represent percentage increases or decreases from the default.
Speech Recognition Channel

Speech Recognition: Used to transcribe speech in audio or video files into subtitle files. The quality of this step directly determines subsequent results. Supports over ten different recognition methods.faster-whisper(local): This is a local model (requires online download on first run), offering good speed and quality. If no special requirements, you can choose this. It has over ten models of different sizes. The smallest, fastest, and most resource-efficient model istiny, but its accuracy is very low and not recommended. The best performing arelarge-v2/large-v3; it's recommended to choose them. Models ending in.enand starting withdistil-only support videos with English speech.openai-whisper(local): Basically similar to the above model but somewhat slower, with possibly slightly higher accuracy. Also recommended to chooselarge-v2/large-v3models.Alibaba FunASR(local): Alibaba's local recognition model, with better support for Chinese. If your original video contains Chinese speech, you can try using it. Also requires online model download on first run.Additionally supports ByteDance Volcano Subtitle Generation, OpenAI Speech Recognition, Gemini Speech Recognition, Alibaba Qwen3-ASR Speech Recognition, and various other online APIs and local models.
Speech Segmentation: Use the default
Overall Recognitionunless specific needs arise. If you want to split into subtitles of equal duration, chooseEqual Segmentation. The segment duration can be specified in Menu -- Tools - Advanced Options - Equal Segmentation Segment Duration, default is 5s.LLM Re-segmentation: When checked, uses a Large Language Model to intelligently re-segment and optimize punctuation for the recognized text, significantly improving subtitle readability. Only available for faster-whisper and openai-whisper channels.
Local Re-segmentation: When checked, performs segmentation based on punctuation and duration for the recognized text. Only available for faster-whisper and openai-whisper channels.
Noise Reduction: When checked, uses Alibaba's noise reduction model to process the audio, improving speech recognition accuracy in noisy environments. Limited by the model's performance, noise reduction may not guarantee better recognition results.
If you already have an original language SRT subtitle file locally and don't want recognition, you can click
Import Original Language Subtitlesin the lower right corner.
Step 3: Set Synchronization & Subtitles

Due to different speech rates in different languages, the duration of the translated dubbing may not match the original video. Adjustments can be made here. Primarily adjusts for cases where the dubbed duration is longer than the original to avoid sound overlap or video ending before sound. Does not handle cases where dubbed duration becomes shorter.
Alignment Control:
Speed Up Dubbing: If a dubbed segment is longer than the original sound segment, speed up the dubbing to match the original duration.Slow Down Video: Similarly, when a dubbed segment is longer than the video segment, slow down the video playback speed for that segment to match the dubbing duration.
Subtitle Embedding:
Do Not Embed Subtitles: Only replaces the sound, does not add any subtitles.Embed Hard Subtitles: Permanently "burns" subtitles into the video frames; cannot be turned off. Subtitles will display when playing anywhere.Embed Soft Subtitles: Encapsulates subtitles as an independent track into the video; players can choose to turn them on/off. Subtitles cannot be displayed when playing in web pages.(Bilingual): Each subtitle line consists of two rows: the original language subtitle and the target language subtitle.
CJK Single Line Character Count: When embedding hard subtitles, Chinese, Japanese, and Korean (CJK) text will be forced to wrap at the specified number of characters to avoid excessive length. Finer control is available in Menu - Tools - Advanced Options - Hard Subtitle Style.
Other Languages: Sets the wrap character count for hard subtitles in languages other than CJK.
Step 4: Process Background Sound

- Keep Original Background Sound: Check this option, and the software will attempt to separate the original video's voice and background sound, retaining the background sound in the final video. Note: This function significantly increases processing time but greatly improves the final product quality.
- Add Extra Background Audio: The above function can easily go wrong and is inefficient. You can use external tools to separate voice and background sound, then select that background sound file here to add as new background music.
- Background Volume: Adjust the volume of the background sound. Less than 1 decreases, greater than 1 increases. Default is 0.8 times the original volume.
Step 5: Start Execution

- CUDA Acceleration: If you have an NVIDIA graphics card and have correctly installed the CUDA environment, be sure to check this option. It can increase speech recognition speed by several times or even dozens of times.
After all settings are complete, click the 【Start】 button.
If processing only one audio or video file, it will pause after subtitle generation and after subtitle translation, allowing you to proofread and modify subtitles in the right-hand text box. Click execute again to continue after confirmation.
If processing multiple audio/video files at once, they will be executed concurrently and交叉ly without pausing.
Step 6: View Results
After the task is completed, click the progress bar area at the bottom to open the output folder. You will see the final MP4 file and materials generated during the process, such as SRT subtitles and dubbed audio files.

In addition to the core video/audio translation, pyVideoTrans also provides several other powerful independent functions.
Function: Transcribe & Translate Subtitles
Supported Video Formats:
mp4/mov/avi/mkv/webm/mpeg/ogg/mts/tsSupported Audio Formats:
wav/mp3/m4a/flac/aac

This function is essentially the first half of the video translation process: transcribing audio/video to generate SRT subtitle files, then translating those subtitle files into the specified language subtitle files, and then stopping. If you only want to generate subtitles based on audio/video, you can choose this function.
Function: Audio/Video to Subtitles / Speech Transcription
Supported Video Formats:
mp4/mov/avi/mkv/webm/mpeg/ogg/mts/tsSupported Audio Formats:
wav/mp3/m4a/flac/aac
This is a dedicated function panel for transcribing audio/video files into text or subtitles. Sometimes you might not want to translate the video but only batch generate subtitles from audio/video; this function is perfect for that.

Batch transcribe video or audio files into subtitles or TXT. Simply drag in files, set the source language (spoken language) and recognition model, then start. Supports advanced features like LLM Re-segmentation and Noise Reduction.
Function: Batch Translate SRT Subtitles / Subtitle Translation
Supported subtitle formats for translation:
srt
If you already have SRT subtitle files, this function can help you quickly translate them into other languages while keeping the timestamps unchanged. Also supports selecting Monolingual Subtitles, Target Language on Top (Bilingual), Target Language on Bottom (Bilingual), and other output formats.

Function: Batch Dubbing for Subtitles / Speech Synthesis
Supported subtitle or text formats for dubbing:
srt/txt
If you have many subtitle files or TXT files and want to batch create voiceovers for them, you can choose this function.
Batch synthesize your SRT files or plain text into dubbed audio files (like WAV or MP3) using the selected TTS engine. Supports fine-tuning of speed, volume, and pitch.

Function: Multi-Role Subtitle Dubbing / Speech Synthesis
Supported subtitle or text formats for dubbing:
srt
Similar to the Batch Dubbing for Subtitles function, the difference is: This function supports assigning a separate speaker for each subtitle line, enabling multi-role dubbing.

Function: Merge Audio/Video/Subtitles
This is a practical post-production tool. When you have separate Video, Dubbed Audio, and Subtitle files, you can use it to perfectly merge the three into a final video file, supporting custom subtitle styles.

Function: Batch Convert Subtitle Formats
Can convert subtitles between different formats, e.g., srt/vtt/ass/txt.

Under Menu -- Tools/Options -- there are more other functions available as needed.

From the above and the software's principles, it's clear that the most important are the 3 channels: Speech Recognition Channel, Translation Channel, and Dubbing Channel
Speech Recognition Channel Introduction
This channel's function is to convert human speech in audio/video into SRT subtitle files. Supports the following 15 speech recognition channels:
- faster-whisper Local Mode
- openai-whisper Local Mode
- Alibaba FunASR Chinese Recognition
- Google Speech Recognition
- ByteDance Volcano Subtitle Generation
- OpenAI Speech Recognition
- Elevenlabs.io Speech Recognition
- Parakeet-tdt Speech Recognition
- STT Speech Recognition API
- Custom Speech Recognition API
- Gemini Large Model Recognition
- Alibaba Bailian Qwen3-ASR
- deepgram.com Speech Recognition
- faster-whisper-xxl.exe Speech Recognition
- 302.AI Speech Recognition
When using faster-whisper and openai-whisper, the following settings can achieve better recognition results. LLM Re-segmentation and Local Re-segmentation are used to assist with subtitle segmentation.
Use large-v2/v3 models
In
Menu - Tools - Advanced Options - faster/openai Speech Recognition Adjustment, set as follows:Speech Threshold set to
0.5Min Speech Duration / ms set to
0Max Speech Duration / s set to
5Silence Split ms set to
140Speech Padding set to
0
Model Download Always Fails
- Method 1: Download directly from the Hugging Face official website. In the software's "Network Proxy" text box, enter your proxy address and port (format:
proxy_address:port). Or set a system-level proxy to ensure the entire computer can access the internet. Create an empty file namedhuggingface.lockin the same directory assp.exe. After this setup, the software will prioritize downloading models from the official website, which is faster and has a higher success rate! - Method 2: Use a dedicated download tool, Click to Download
- Method 3: Manually download the compressed package from GitHub
Translation Channel Introduction
The translation channel is used to translate the original subtitles generated by the
Speech Recognition Channelinto the target language subtitles, e.g., translating Chinese subtitles to English or vice versa. Supports 23 translation channels:
Translation results have blank lines or are missing many lines
Cause Analysis: When using traditional translation channels like Baidu Translate, Tencent Translate, etc., or when using AI translation but Send Complete Subtitles is not selected, the subtitle text is sent line by line to the translation engine, expecting the same number of lines in return. If the translation engine returns a different number of lines than sent, blank lines will appear.
Solution: How to completely avoid these two situations? Avoid using local small models, especially 7b, 14b, 32b, etc. If you must use them, it's recommended to change the trans_thread for simultaneous subtitle translation to 1, and deselect Send Complete Subtitles. Open Menu -- Tools -- Advanced Settings and change the number of subtitles translated simultaneously trans_thread to 1. However, this method is slower and cannot consider context, resulting in poorer quality.
Use smarter online AI large models, e.g., gemini/deepseek online API, etc.
- Using AI translation, prompts appear in the results
When using AI translation channels, the translation output includes prompts. This situation is more common with locally deployed small models, e.g., 14b, 32b, etc. The root cause is the model's scale is too small, lacking sufficient intelligence to strictly follow instructions.
Dubbing Channel Introduction
Used for dubbing line by line based on subtitle files. Supports the following dubbing channels:
How to perform voice cloning?
You can select F5-TTS / index-tts / clone-voice / CosyVoice / GPT-SOVITS / Chatterbox, etc., in the dubbing channels, and select the clone role. This will use the original voice as a reference audio for dubbing, resulting in a voiceover with the original timbre.
Note: The reference audio generally requires 3-10 seconds duration, and should be free of background noise with clear pronunciation; otherwise, the cloning effect will be poor.
Advanced Options Explanation
Menu - Tools - Advanced Options contain more fine-grained controls for personalized adjustments.

General Settings
Software UI Language: Set the software interface language. Requires restart after modification.Pause Countdown / s: Countdown seconds when pausing during single video translation.BGM Separation clip Duration / s: Set the segment duration for separating background sound to prevent freezing with long videos. Default 300s.Set output Directory: Home directory, used to save results like video separation, subtitle dubbing, subtitle translation, etc. Default is user home directory.LLM Re-segmentation Characters/Words per Batch: Number of characters or words sent per batch during LLM large model re-segmentation. Larger values yield better segmentation; sending all subtitles at once is best, but limited by the large model's output token limit, overly long input may cause failure.AI Channel for LLM Re-segmentation: The AI channel used for LLM re-segmentation. Currently supports openai or deepseek channels.Audio Slices per Send for Gemini Speech Recognition: Number of audio slices sent at once when using Gemini for speech recognition. Larger values may yield better results but increase failure rate.Disable Desktop Notifications: Do not show desktop notifications after task completion or failure.
Video Output Control
Video Transcoding Loss Control: Video transcoding loss control. 0=lowest loss, 51=highest loss. Default 13.Output Video Quality Compression Rate Control: Mainly adjusts the balance between encoding speed and quality. Options: ultrafast, superfast, veryfast, faster, fast, medium, slow, slower, veryslow. Encoding speed from fast to slow, compression ratio from low to high, video size from large to small.Custom ffmpeg Command Parameters: Custom ffmpeg command parameters, will be added in the second-to-last position, e.g., -bf 7 -b_ref_mode middle.Force soft encoding of video: Use libx264/libx265 encoding video.264 or 265 Video Encoding: Use libx264 encoding or libx265 encoding. 264 has better compatibility, 265 has higher compression ratio and clarity.
Hard Subtitle Style
Hard Subtitle Font Pixel Size: Hard subtitle font pixel size.Hard Subtitle Font Name: Font name for hard subtitles.Hard Subtitle Text Color: Set the font color. Note the 6 characters after &H, every 2 characters represent BGR color respectively (2 chars Blue / 2 chars Green / 2 chars Red), which is the reverse of the common RGB order.Hard Subtitle Text Border/Outline Color: Set the font border/outline color (in outline mode). Note the 6 characters after &H, every 2 characters represent BGR color respectively (2 chars Blue / 2 chars Green / 2 chars Red), which is the reverse of the common RGB order.Hard Subtitle Background Block or Shadow Color: Background color in background block mode, possibly shadow color in outline mode; may vary depending on player support.Hard Subtitle Position: Position of the subtitles. Default bottom.Subtitle Vertical Margin: Subtitle vertical margin.Subtitle Left Margin: Subtitle left margin.Subtitle Right Margin: Subtitle right margin.Subtitle Shadow Size: Subtitle shadow size.Subtitle Outline Thickness: Subtitle outline thickness.Outline Mode or Background Block Mode: Outline mode means subtitles have text outline and shadow but no background block; background block style is the opposite.
Subtitle Translation Adjustment
Traditional Translation Subtitles per Send: Number of subtitle lines sent per request for traditional translation.AI Translation Subtitles per Send: Number of subtitle lines sent per request for AI translation.Translation Error Retry Count: Number of retries upon translation error.Pause After Translation / s: Pause time in seconds after each translation, used to limit request frequency.Send Complete Subtitle Content When Using AI Translation: Whether to send complete subtitle format content when using AI/Google translation.
Dubbing Adjustment
Simultaneous Dubbing Subtitles Count: Number of subtitle lines dubbed simultaneously.Pause After Dubbing / s: Pause time in seconds after each dubbing, used to limit request frequency.Keep Dubbing File for Each Subtitle Line: Keep the dubbed audio file for each subtitle line.AzureTTS Batch Line Count: Number of lines dubbed at once for AzureTTS.ChatTTS Voice Value: ChatTTS voice value.
Subtitle Audio-Video Alignment
Remove Dubbing End Silence: Whether to remove silence at the end of dubbed audio.
faster/openai Speech Recognition Adjustment
Enable VAD: Whether to enable VAD during faster-whisper overall recognition mode.Speech Threshold: Represents the probability threshold for speech. VAD outputs the speech probability for each audio segment. Probabilities above this value are considered speech (SPEECH), below are considered silence or background noise. Default is 0.5, suitable for most cases. For different datasets, you can adjust this value to more accurately distinguish speech from noise. If too many false positives, try increasing to 0.6 or 0.7; if too many speech segments are lost, decrease to 0.3 or 0.4.Min Speech Duration / ms: Minimum speech duration in milliseconds. Detected speech segments shorter than this are discarded. Aims to remove brief non-speech sounds or noise. Adjust as needed; if short speech segments are easily misjudged as noise, increase this value, e.g., set to 1000 ms.Max Speech Duration / s: Maximum speech duration in seconds. The maximum length of a single speech segment. If a segment exceeds this duration, it will be split. If no silence position is found, it will be forcibly split before this duration to avoid overly long continuous segments. If you want to control segment length, e.g., for dialogue or segmented output, set according to specific needs, like 10s or 30s. 0 means infinite.Silence Split ms: Minimum silence duration in milliseconds. The silence time waited after speech is detected. Speech segments are only split if the silence duration exceeds this value.Speech Padding ms: Speech padding time in milliseconds. Padding time added before and after detected speech segments to avoid cutting segments too tightly, potentially cutting off edge speech. If cut segments are missing parts, increase this value, e.g., 500 ms or 800 ms. Conversely, if segments are too long or contain too much invalid content, decrease this value.Google Recognition API Silence Segment / ms: Google Recognition API silence segment in ms.Equal Segmentation Segment Duration / s: Segment duration in seconds for equal segmentation mode.faster and openai Model List: Comma-separated list of model names for faster and openai modes.CUDA Data Type: CUDA data type for faster mode. int8=lower resource consumption, faster speed, lower precision; float32=higher resource consumption, slower speed, higher precision; int8_float16=device auto-select.- `
