Skip to content

Original Voice Cloning and Multi-Character Dubbing

Part 1: Video-Based Voice Cloning

Voice cloning refers to: Using the original speaker's voice in a video to dub over it. For example, translating Chinese into English sounds as if the same person is now speaking English instead of Chinese.

Among software dubbing channels, any channel with a clone option in the dubbing character list supports voice cloning. Selecting clone means performing voice cloning during dubbing.

Cloning principle: Extract the subtitle data to be dubbed, loop through each subtitle, use the starting time of that subtitle to clip the corresponding audio segment from the original video as reference audio, and then send the reference audio and subtitle text together to the dubbing channel for dubbing.

Channels Supporting Voice Cloning

  • OmniVoice (local API): Supports all languages (recommended)
  • Qwen-TTS (built-in locally): Supports over 10 common languages like Chinese, English, Japanese, Korean (recommended)
  • GPT-SoVITS (local API): Supports Chinese, English, Japanese, Korean (recommended)
  • F5-TTS (local API): Supports Chinese and English (recommended)
  • VoxCPM-TTS (local API): Supports over 10 languages (recommended)
  • Chatterbox (built-in locally): Supports over 10 languages (recommended)
  • Index-TTS (local API): Supports Chinese and English (recommended)
  • CosyVoice (local API): Supports over 10 common languages like Chinese, English, Japanese, Korean
  • Spark-TTS (local API): Supports English
  • Dia-TTS (local API): Supports English
  • clone-voice (local API): Supports over 10 languages (no longer maintained, not recommended)

How to Use

Since an original video is needed, this feature is only available in the Translate Video and Audio function.

  1. First, select the target language for dubbing from the Target Language dropdown.
  2. Choose a dubbing channel from the Dubbing Channel list. For channels marked with (local API), you must deploy the corresponding service locally on your computer. Refer to the respective documentation for deployment methods. After deployment, enter the API or WebUI address in Software - TTS Settings - Corresponding Channel Settings - URL.
  3. Then, select the clone option from the Dubbing Character dropdown.

Optimal Cloning Configuration

To ensure the best cloning results, it is recommended to follow these settings:

  1. Avoid using LLM Re-segmentation as it re-divides the timeline, causing confusion when clipping reference audio from the original video.
  2. Ensure each subtitle duration is between 3-10s. Too short reference audio (e.g., less than 3s) may result in noise, while too long (e.g., more than 10s) may cause errors in some channels. Open Menu - Tools/Options - Advanced Options - Speech Recognition Parameters, set Maximum Voice Duration to 6-10 and Minimum Voice Duration (in milliseconds) to 3000-4000 to define the subtitle range. Also, select the option to Merge Overly Short Subtitles so the program automatically merges them with adjacent ones.
  3. Use an AI engine for translation, such as DeepSeek or OpenAI ChatGPT, and select Send Complete Subtitles.
  4. For speech recognition, for Chinese, use Qwen-ASR/Doubao Voice Large Model - Speed Version/Ali Bailian etc., and for English, use Faster-whisper + large-v3 model.
  5. Click Set More Parameters and select Separate Vocals and Background Noise to obtain clean vocals without background noise, thereby improving cloning quality.

If many of your subtitles are shorter than 3s, it is recommended to use the OmniVoice-TTS dubbing channel, which avoids errors with short reference audio.

Using Reference Audio

Sometimes you may not want to clone the original video's voice but use a voice from a local audio file or even your own voice.

  1. First, record or otherwise obtain a 5-10s WAV format audio file. Ensure the audio contains clear, accurate, and single voice without background noise, and no extra silence at the beginning or end. For example, you can use tools like CapCut to extract a 10s speech segment from a longer audio or video as reference audio.
  2. Ensure the audio is in WAV format, named with a short name like myaudio1.wav, and copy it to the software/f5-tts folder. Then, open Software Menu - TTS Settings - Set Reference Audio, start a new line in the text box, and enter myaudio1.wav#the text spoken in this audio, then save. For example:
myaudio1.wav#You say all is empty, yet you keep your eyes closed. If you opened them to look at me, I don't believe you would see nothing.

Note: For GPT-SoVITS dubbing, reference audio should be placed in the root directory of the GPT-SoVITS software, not in the f5-tts folder.

  1. After saving, return to the main interface, select myaudio1.wav from the dubbing character dropdown, and you can use it.

WAV format audio files have a suffix of .wav. If you cannot see it, open any folder, click View - File Name Extensions in the folder's navigation bar, and check it. In Windows 11, it is View - Show - File Name Extensions.


Part 2: Multi-Character Subtitle-Based Dubbing

Since v3.74, the "Multi-Character Subtitle Dubbing" feature has been added. Click the Multi-Character Subtitle Dubbing button on the left toolbar. In the pop-up window, import the SRT subtitle file to be dubbed, then assign a character to each subtitle to achieve multi-voice dubbing.

tts-duo