Skip to content

SenseVoice is an open-source speech recognition foundation model released by Alibaba. It supports speech recognition in Chinese, Japanese, Korean, and English. Compared to previous models, it features faster recognition speed and higher accuracy.

However, the official release version does not come with timestamp output, which is inconvenient for creating subtitles. Currently, I use other VAD models for pre-segmentation and then use SenseVoice for recognition. I have created this API project and integrated it into video translation software for convenient use.

SenseVoice Official Repository

This API Project: https://github.com/jianchang512/sense-api

Purpose of This Project

  1. Replace the official api.py file to implement SRT subtitle output with timestamps.
  2. Connect to video translation and dubbing software for use.
  3. Includes a Windows integration package. You can start the API by double-clicking run-api.bat or launch the browser interface by double-clicking run-webui.bat.

This api.py ignores emotion recognition processing and only supports recognition of Chinese, Japanese, Korean, and English speech.

First, Deploy the SenseVoice Project

  1. Deploy using the official source code method, which supports deployment on Windows/Linux/MacOSX. For specific tutorials, refer to the SenseVoice project homepage: https://github.com/FunAudioLLM/SenseVoice. After deployment, download the api.py file from this project and overwrite the api.py file included in the official package (If you want to use it in video translation software, you must overwrite it; otherwise, you will not be able to obtain subtitles with timestamps).

  2. Deploy using the Windows integration package, which only supports deployment on Windows 10/11. Download the compressed package from the right side of this page: https://github.com/jianchang512/sense-api/releases. After extracting the package, double-click run-api.bat to use the API, or double-click run-webui.bat to open the web interface.

Using the API

The default API address is http://127.0.0.1:5000/asr

You can open the api.py file to modify it:

HOST='127.0.0.1'
PORT=5000
  1. If you deployed using the official source code, remember to overwrite the api.py file and then execute python api.py.
  2. If you are using the Windows integration package, double-click run-api.bat.
  3. Wait for the terminal to display the text http://127.0.0.1:5000, indicating that the startup was successful and you can use it.

Note: The first time you use it, you will download the model from modelscope, which will take a long time.

Using it in Video Translation and Dubbing Tools

Fill in the API address in the menu -- Speech Recognition Settings - SenseVoice Speech Recognition window's API address field.

Calling the API in Source Code

  • API address: Assuming the default API address is http://127.0.0.1:5000
  • Call method: POST
  • Request parameters:
    • lang: String type, can be one of zh, ja, ko, or en.
    • file: The audio binary data to be recognized, in WAV format.
  • Response returned:
    • Successful recognition returns: {code:0,msg:ok,data:"Complete SRT subtitle format string"}
    • Failed recognition returns: {code:1,msg:"Reason for error"}
    • Other internal errors return: {detail:"Error information"}

Example: To recognize the 10.wav audio file, where the spoken language is Chinese.

python
import requests

res = requests.post(
    f"http://127.0.0.1:5000/asr",
    files={"file": open("c:/users/c1/videos/10s.wav", 'rb')},
    data={"lang": "zh"},
    timeout=7200,
)
print(res.json())

Using WebUI in a Browser

  1. If you deployed the official package using the source code, execute python webui.py. When the terminal displays http://127.0.0.1:7860, enter that address in your browser to use it.
  2. If you are using the Windows integration package, double-click run-webui.bat. After the startup is successful, the browser will open automatically.