Skip to content

ChatTTS has become incredibly popular, yet its documentation is vague, especially regarding the specific control of tone, intonation, and speaker. After repeated testing and troubleshooting, I finally understand a bit of it, and I'm recording it here.

UI code open-source address: https://github.com/jianchang512/chattts-ui

Control Symbols Available in Text

Control symbols can be interspersed in the original text to be synthesized. Currently, there are two controllable types: laughter and pauses.

[laugh] represents laughter

[uv_break] represents a pause

Example text:

text="你好啊[uv_break]朋友们,听说今天是个好日子,难道[uv_break]不是吗[laugh]?"

In actual synthesis, [laugh] will be replaced by laughter, and a pause will be added at [uv_break].

The intensity of laughter and pauses can be controlled by passing a prompt in the params_refine_text parameter.

laugh_(0-2) Optional values: laugh_0 laugh_1 laugh_2 Laughter becomes stronger/or?

break_(0-7) Optional values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become more noticeable/or?.

Code


chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})

However, actual testing shows that the difference between [break_0] and [break_7] is not significant, and the same is true for [laugh_0] to [laugh_2].

Skip the refine text stage

During actual synthesis, the control characters will be reorganized (refine text). For example, the above example text will eventually be reorganized into:

你 好 啊 [uv_break] 啊 [uv_break] 嗯 [uv_break] 朋 友 们 , 听 说 今 天 是 个 好 日 子 , 难 道 [uv_break] 嗯 [uv_break] 不 是 吗 [laugh] ? [uv_break]

As you can see, the control characters are not consistent with the ones I marked. The actual synthesis effect may have unexpected pauses, noise, or laughter. So how do we force it to synthesize according to the actual text?

Set the skip_refine_text parameter to True to skip the refine text stage.

chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})

Fixed Speaker Voice

By default, different voices are randomly called each time, which is very unfriendly, and there is no specific explanation of voice selection.

To simply fix the speaker role, you first need to manually set a random seed. Different seeds will produce different voices.

torch.manual_seed(2222)

Then get a random speaker

rand_spk = chat.sample_random_speaker()

Then pass it through the params_infer_code parameter

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})

After testing, 2222 7869 6653 are male voices, and 3333 4099 5099 are female voices. More roles can be tested by adjusting different seed numbers.

Speech Rate Control

The speech rate can be controlled by setting prompt in the params_infer_code parameter of chat.infer.

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})

The speed value does not have a listed selectable range. The default in the source code is speed_5, but testing speed_0 and speed_7 showed no significant difference.

WebUI Interface and Integration Package

Open source and download address: https://github.com/jianchang512/chatTTS-ui

Double-click app.exe after decompressing the integration package.

Source code deployment follows the repository instructions.

UI Interface Preview