ChatTTS has become incredibly popular, yet its documentation is vague, especially regarding the specific control of tone, intonation, and speaker. After repeated testing and troubleshooting, I finally understand a bit of it, and I'm recording it here.
UI code open-source address: https://github.com/jianchang512/chattts-ui
Control Symbols Available in Text
Control symbols can be interspersed in the original text to be synthesized. Currently, there are two controllable types: laughter and pauses.
[laugh]
represents laughter
[uv_break]
represents a pause
Example text:
text="你好啊[uv_break]朋友们,听说今天是个好日子,难道[uv_break]不是吗[laugh]?"
In actual synthesis, [laugh]
will be replaced by laughter, and a pause will be added at [uv_break]
.
The intensity of laughter and pauses can be controlled by passing a prompt in the params_refine_text
parameter.
laugh_(0-2)
Optional values: laugh_0
laugh_1
laugh_2
Laughter becomes stronger/or?
break_(0-7)
Optional values: break_0
break_1
break_2
break_3
break_4
break_5
break_6
break_7
Pauses become more noticeable/or?.
Code
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})
However, actual testing shows that the difference between
[break_0]
and[break_7]
is not significant, and the same is true for[laugh_0]
to[laugh_2]
.
Skip the refine text stage
During actual synthesis, the control characters will be reorganized (refine text). For example, the above example text will eventually be reorganized into:
你 好 啊 [uv_break] 啊 [uv_break] 嗯 [uv_break] 朋 友 们 , 听 说 今 天 是 个 好 日 子 , 难 道 [uv_break] 嗯 [uv_break] 不 是 吗 [laugh] ? [uv_break]
As you can see, the control characters are not consistent with the ones I marked. The actual synthesis effect may have unexpected pauses, noise, or laughter. So how do we force it to synthesize according to the actual text?
Set the skip_refine_text
parameter to True
to skip the refine text stage.
chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
Fixed Speaker Voice
By default, different voices are randomly called each time, which is very unfriendly, and there is no specific explanation of voice selection.
To simply fix the speaker role, you first need to manually set a random seed. Different seeds will produce different voices.
torch.manual_seed(2222)
Then get a random speaker
rand_spk = chat.sample_random_speaker()
Then pass it through the params_infer_code
parameter
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})
After testing, 2222 7869 6653
are male voices, and 3333 4099 5099
are female voices. More roles can be tested by adjusting different seed numbers.
Speech Rate Control
The speech rate can be controlled by setting prompt
in the params_infer_code
parameter of chat.infer
.
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})
The speed value does not have a listed selectable range. The default in the source code is speed_5
, but testing speed_0
and speed_7
showed no significant difference.
WebUI Interface and Integration Package
Open source and download address: https://github.com/jianchang512/chatTTS-ui
Double-click app.exe
after decompressing the integration package.
Source code deployment follows the repository instructions.