Skip to content

F5-TTS-api

This project's source code is available at https://github.com/jianchang512/f5-tts-api

This is the API and WebUI for the F5-TTS project.

F5-TTS is an advanced text-to-speech system that uses deep learning technology to generate realistic, high-quality human voices. With just a 10-second audio sample, you can clone your voice. F5-TTS accurately reproduces speech and imbues it with rich emotional nuances.

Original voice: Daughter Kingdom King

Cloned audio:

Windows Integration Package (Includes F5-TTS model and runtime environment)

Download from 123 Cloud Disk: https://www.123684.com/s/03Sxjv-kKjB3

Huggingface download address: https://huggingface.co/spaces/mortimerme/s4/resolve/main/f5-tts-api-v0.3.7z?download=true

Patch Download (2024-11-27)

After downloading the patch, unzip it to the folder containing api.py to complete the upgrade.

Patch download address: https://github.com/jianchang512/f5-tts-api/releases/download/v0.1/2024-1127-buding.7z

Supported System: Windows 10/11 (Extract after download to use)

How to Use:

Start the API service: Double-click the run-api.bat file. The API address is http://127.0.0.1:5010/api. image.png

The API service must be started to use it in translation software.

The integration package defaults to CUDA version 11.8. If you have an NVIDIA graphics card and have configured the CUDA/cuDNN environment, the system will automatically use GPU acceleration. If you want to use a higher version of CUDA, such as 12.4, please follow these steps:

Go to the folder containing api.py, type cmd in the folder address bar and press Enter. Then, in the terminal that pops up, execute the following commands:

.\runtime\python -m pip uninstall -y torch torchaudio

.\runtime\python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

The advantage of F5-TTS lies in its efficiency and high-quality voice output. Compared to similar technologies that require longer audio samples, F5-TTS only needs a short audio clip to generate high-fidelity speech and can effectively express emotions, enhancing the listening experience—something many existing technologies struggle to achieve.

Currently, F5-TTS supports English and Chinese.

In summary, F5-TTS is a powerful text-to-speech tool that not only produces high-quality speech but also generates expressive voice. With its convenient voice cloning function, you can easily convert text into realistic, emotional audio. The downside is that the generation speed is a bit slow.

Usage Tips: Proxy/VPN

The model needs to be downloaded from the huggingface.co website. Since this website is inaccessible in China, please set up a system proxy or global proxy in advance, otherwise the model download will fail.

The integrated package includes most of the necessary models, but it may check for updates or download other dependent small models, so if the terminal shows an HTTPSConnect error, you still need to set up a system proxy.

Using in Video Translation Software

  1. Start the API service. The API service must be started to use it in translation software.

  2. Open the video translation software, find the TTS settings, select F5-TTS, and enter the API address (defaults to http://127.0.0.1:5010).

  3. Enter the reference audio and audio text.

  4. It is recommended to select the f5-tts model for better generation quality.

Quick Test

Skip this step if you don't need to test.

  1. After downloading and unzipping the integration package, copy the api.py file, rename the copied file to test.py, delete all the content in test.py, and paste the following content into test.py.
  2. Find a 10-second audio file you want to clone the voice from, in WAV format, with clear pronunciation and no noise. Rename it to 1.wav and place it in the same directory as test.py. Fill in the corresponding pronunciation text from 1.wav after "ref_text" in the code below, without wrapping.
  3. Fill in the text you want to synthesize after "gen_text" in the code below.
  4. Double-click run-api.py to start the API service. After successful startup, type cmd in the address bar of the test.py folder and press Enter. Then, enter the command .\runtime\python test.py and wait for the execution to complete. A ceshi.wav file will be generated in the folder; this is the cloned voice.

image.png

image.png

import requests

res=requests.post('http://127.0.0.1:5010/api',data={
    "ref_text": 'Enter the text corresponding to 1.wav here',
    "gen_text": '''Enter the text to be generated here.''',
    "model": 'f5-tts'
},files={"audio":open('./1.wav','rb')})

if res.status_code!=200:
    print(res.text)
    exit()

with open("ceshi.wav",'wb') as f:
    f.write(res.content)

image.png