Skip to content

F5-TTS-api

The source code for this project is available at https://github.com/jianchang512/f5-tts-api

This is the API and web UI for the F5-TTS project.

F5-TTS is an advanced text-to-speech system that uses deep learning technology to generate realistic, high-quality human voices. With just a short 10-second audio sample, you can clone your voice. F5-TTS can accurately reproduce your voice and infuse it with rich emotional expression.

Original Voice - "Queen of the Daughter Country"

Cloned Audio

Windows Integrated Package (Includes F5-TTS Model and Runtime Environment)

Download from 123Pan: https://www.123684.com/s/03Sxjv-okTJ3

Download from Hugging Face: https://huggingface.co/spaces/mortimerme/s4/resolve/main/f5-tts-api-v0.3.7z?download=true

Supported Systems: Windows 10/11 (Download and extract to use)

How to use:

Start the API Service: Double-click the run-api.bat file. The API address will be http://127.0.0.1:5010/api. image.png

You must start the API service to use it in the translation software.

The integrated package uses CUDA 11.8 by default. If you have an NVIDIA graphics card and have configured the CUDA/cuDNN environment, the system will automatically use GPU acceleration. If you want to use a higher version of CUDA, such as 12.4, follow these steps:

Navigate to the folder containing api.py, enter cmd in the folder address bar, and press Enter. Then, execute the following commands in the terminal that appears:

.\runtime\python -m pip uninstall -y torch torchaudio

.\runtime\python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

The advantage of F5-TTS lies in its efficiency and high-quality voice output. Compared to similar technologies that require longer audio samples, F5-TTS can generate high-fidelity speech with very short audio, and it can express emotions well, improving the listening experience. This is something that many existing technologies struggle to achieve.

Currently, F5-TTS supports English and Chinese.

Usage Tips: Proxy/VPN

The model needs to be downloaded from the huggingface.co website. Since this website is not accessible in some regions, please set up a system proxy or global proxy in advance, otherwise the model download will fail.

The integrated package has already included most of the required models, but it may detect updates or download other small dependency models. Therefore, if a HTTPSConnect error appears in the terminal, you still need to set up a system proxy.

Using in Video Translation Software

  1. Start the API service. You must start the API service to use it in the translation software.

  2. Open the video translation software, find the TTS settings, select F5-TTS, and enter the API address (default is http://127.0.0.1:5010).

  3. Enter the reference audio and the corresponding text.

  4. It's recommended to select "f5-tts" for better generation quality.

Using api.py within a Third-Party Integrated Package

  1. Copy the api.py and configs folder to the root directory of the third-party integrated package.
  2. Find the path to the Python executable integrated in the third-party package, for example, in the py311 folder. Then, enter cmd in the folder address bar in the root directory and press Enter. Next, execute the command .\py311\python api.py. If you see an error like module flask not found, first execute .\py311\python -m pip install waitress flask.

Using api.py After Source Code Deployment of the Official F5-TTS Project

  1. Copy the api.py and configs folder to the project folder.
  2. Install the required modules: pip install flask waitress
  3. Execute python api.py

API Usage Example

import requests

res=requests.post('http://127.0.0.1:5010/api',data={
    "ref_text": 'Enter the text corresponding to 1.wav here',
    "gen_text": '''Enter the text to be generated here.''',
    "model": 'f5-tts'
},files={"audio":open('./1.wav','rb')})

if res.status_code!=200:
    print(res.text)
    exit()

with open("ceshi.wav",'wb') as f:
    f.write(res.content)

OpenAI TTS Interface Compatibility

The voice parameter must use three # symbols to separate the reference audio and the text corresponding to the reference audio. For example:

1.wav###你说四大皆空,却为何紧闭双眼,若你睁开眼睛看看我,我不相信你,两眼空空。

This indicates that the reference audio is 1.wav (located in the same directory as api.py), and the text content of 1.wav is "你说四大皆空,却为何紧闭双眼,若你睁开眼睛看看我,我不相信你,两眼空空。".

The returned data is fixed as WAV audio data.

import requests
import json
import os
import base64
import struct


from openai import OpenAI

client = OpenAI(api_key='12314', base_url='http://127.0.0.1:5010/v1')
with  client.audio.speech.with_streaming_response.create(
                    model='f5-tts',
                    voice='1.wav###你说四大皆空,却为何紧闭双眼,若你睁开眼睛看看我,我不相信你,两眼空空。',
                    input='你好啊,亲爱的朋友们',
                    speed=1.0
                ) as response:
    with open('./test.wav', 'wb') as f:
       for chunk in response.iter_bytes():
            f.write(chunk)