Voice Cloning & Multi-Role Dubbing

This guide explains how to use the original speaker's voice for dubbing (voice cloning), and how to assign different dubbing voices to different characters.

1. Voice Cloning

What is Voice Cloning?

Voice cloning uses the original speaker's voice from the source video to generate dubbing in the target language. For example, translating a Chinese video to English while making the English dubbing sound like the original Chinese speaker.

How It Works

Extract subtitle data for dubbing
Cut corresponding audio segments from the original video based on subtitle timestamps, serving as reference audio
Send the reference audio along with translated subtitles to a TTS engine that supports voice cloning

Supported Channels

Channel	Type	Languages	Rating
OmniVoice-TTS	Local API	All languages	⭐⭐⭐ Recommended
Qwen-TTS	Local built-in	Chinese, English, Japanese, Korean + 10 more	⭐⭐⭐ Recommended
GPT-SoVITS	Local API	Chinese, English, Japanese, Korean	⭐⭐⭐ Recommended
Confucius-TTS	Local API	14 languages	⭐⭐⭐
F5-TTS	Local API	Chinese, English	⭐⭐⭐ Recommended
Index-TTS	Local API	Chinese, English	⭐⭐⭐ Recommended
VoxCPM-TTS	Local API	10+ languages	⭐⭐⭐ Recommended
ChatterBox	Local built-in	10+ languages	⭐⭐ Recommended
CosyVoice	Local API	Chinese, English, Japanese, Korean + 10 more	⭐⭐
Spark-TTS	Local API	English	⭐⭐
Dia-TTS	Local API	English	⭐⭐

Best Cloning Configuration

For optimal cloning results, configure the following:

Disable LLM re-segmentation — Re-segmenting shifts the timeline, causing misaligned reference audio extraction
Control subtitle duration:
- Menu → Tools → Advanced Options → Speech Recognition Parameters
- Max speech duration: 6-10 seconds
- Min speech duration: 3000-4000 ms
- Enable "Merge short subtitles to adjacent"
Translation channel: Use DeepSeek or OpenAI with "Send SRT" enabled
Vocal/BGM separation: Enable "Separate Vocal/BGM" in main settings — greatly improves cloning quality
Speech recognition:
- Chinese: Volcengine STT / Qwen-ASR (Local)
- English: faster-whisper (Local) + large-v3 model

Using Local Reference Audio

Sometimes you may want to use a local audio file's voice instead of cloning from the original video.

Steps:

Prepare a 5-10 second WAV audio file ensuring:
- Clear, accurate single-speaker voice
- No background noise
- No excessive silence at the beginning or end
Copy the audio to the f5-tts folder in the software directory
Open TTS Settings → Set Reference Audio and fill in:
```
myaudio1.wav#你说四大皆空，却为何紧闭双眼
```
1
(Format: filename.wav#spoken text of the audio)
After saving, select myaudio1.wav from the voice role dropdown in the main interface

Note: GPT-SoVITS reference audio must be placed in the GPT-SoVITS software root directory, not in the f5-tts folder.

2. Multi-Role Dubbing

Feature Description

Multi-role dubbing allows you to assign different AI dubbing voices to different speakers in a video. For example:

Male characters use a male voice
Female characters use a female voice
Different characters use different voice tones

How to Use

Select a dubbing channel in the main interface
Enable "Identify Speakers" during speech recognition
After translation completes, in the speaker role assignment dialog, select a different dubbing voice for each speaker
Click "Confirm" to continue processing

Voice Cloning & Multi-Role Dubbing ​

1. Voice Cloning ​

What is Voice Cloning? ​

How It Works ​

Supported Channels ​

Best Cloning Configuration ​

Using Local Reference Audio ​

2. Multi-Role Dubbing ​

Feature Description ​

How to Use ​