F5-TTS-api
This project's source code is available at https://github.com/jianchang512/f5-tts-api
This is the API and WebUI for the F5-TTS project.
F5-TTS is an advanced text-to-speech system that uses deep learning technology to generate realistic, high-quality human voices. With just a 10-second audio sample, you can clone your voice. F5-TTS accurately reproduces speech and imbues it with rich emotional nuances.
Original voice: Daughter Kingdom King
Cloned audio:
Windows Integration Package (Includes F5-TTS model and runtime environment)
Download from 123 Cloud Disk: https://www.123684.com/s/03Sxjv-kKjB3
Huggingface download address: https://huggingface.co/spaces/mortimerme/s4/resolve/main/f5-tts-api-v0.3.7z?download=true
Patch Download (2024-11-27)
After downloading the patch, unzip it to the folder containing
api.py
to complete the upgrade.Patch download address: https://github.com/jianchang512/f5-tts-api/releases/download/v0.1/2024-1127-buding.7z
Supported System: Windows 10/11 (Extract after download to use)
How to Use:
Start the API service: Double-click the run-api.bat
file. The API address is http://127.0.0.1:5010/api
.
The API service must be started to use it in translation software.
The integration package defaults to CUDA version 11.8. If you have an NVIDIA graphics card and have configured the CUDA/cuDNN environment, the system will automatically use GPU acceleration. If you want to use a higher version of CUDA, such as 12.4, please follow these steps:
Go to the folder containing
api.py
, typecmd
in the folder address bar and press Enter. Then, in the terminal that pops up, execute the following commands:
.\runtime\python -m pip uninstall -y torch torchaudio
.\runtime\python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
The advantage of F5-TTS lies in its efficiency and high-quality voice output. Compared to similar technologies that require longer audio samples, F5-TTS only needs a short audio clip to generate high-fidelity speech and can effectively express emotions, enhancing the listening experience—something many existing technologies struggle to achieve.
Currently, F5-TTS supports English and Chinese.
In summary, F5-TTS is a powerful text-to-speech tool that not only produces high-quality speech but also generates expressive voice. With its convenient voice cloning function, you can easily convert text into realistic, emotional audio. The downside is that the generation speed is a bit slow.
Usage Tips: Proxy/VPN
The model needs to be downloaded from the huggingface.co website. Since this website is inaccessible in China, please set up a system proxy or global proxy in advance, otherwise the model download will fail.
The integrated package includes most of the necessary models, but it may check for updates or download other dependent small models, so if the terminal shows an
HTTPSConnect
error, you still need to set up a system proxy.
Using in Video Translation Software
Start the API service. The API service must be started to use it in translation software.
Open the video translation software, find the TTS settings, select F5-TTS, and enter the API address (defaults to http://127.0.0.1:5010).
Enter the reference audio and audio text.
It is recommended to select the f5-tts model for better generation quality.
Quick Test
Skip this step if you don't need to test.
- After downloading and unzipping the integration package, copy the
api.py
file, rename the copied file totest.py
, delete all the content intest.py
, and paste the following content intotest.py
. - Find a 10-second audio file you want to clone the voice from, in WAV format, with clear pronunciation and no noise. Rename it to
1.wav
and place it in the same directory astest.py
. Fill in the corresponding pronunciation text from1.wav
after "ref_text" in the code below, without wrapping. - Fill in the text you want to synthesize after "gen_text" in the code below.
- Double-click
run-api.py
to start the API service. After successful startup, typecmd
in the address bar of thetest.py
folder and press Enter. Then, enter the command.\runtime\python test.py
and wait for the execution to complete. Aceshi.wav
file will be generated in the folder; this is the cloned voice.
import requests
res=requests.post('http://127.0.0.1:5010/api',data={
"ref_text": 'Enter the text corresponding to 1.wav here',
"gen_text": '''Enter the text to be generated here.''',
"model": 'f5-tts'
},files={"audio":open('./1.wav','rb')})
if res.status_code!=200:
print(res.text)
exit()
with open("ceshi.wav",'wb') as f:
f.write(res.content)