Speech-to-Text Tool

Speech-to-Text Tool Open Source Address

This is a locally running offline speech-to-text tool, based on the openai-whisper open-source model. It can recognize human speech from video/audio and convert it into text, outputting in JSON format, SRT subtitle format with timestamps, and plain text format. It can be used for self-deployment as a replacement for the openai speech recognition interface or Baidu speech recognition, with accuracy basically equivalent to the official openai API interface.

After deployment or download, double-click start.exe to automatically open the local webpage in your default browser.
Drag or click to select the audio or video file to be recognized, then select the spoken language, text output format, and the model to be used (base model is built-in). Click "Start Recognition," and after completion, the results will be output in the selected format on the current webpage.
The entire process requires no internet connection and runs completely locally. It can be deployed on an intranet.
The openai-whisper open-source model has base/small/medium/large/large-v3. The base model is built-in. From base to large-v3, the recognition effect gets better, but the required computer resources also increase. You can download other models as needed and place them in the models directory.
Download All Models

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment

Click here to open the Releases page and download the pre-compiled files.
After downloading, extract it to a location, such as E:/stt
Double-click start.exe, wait for the browser window to open automatically.
Click the upload area on the page, find the audio or video file you want to recognize in the popup window, or directly drag the audio or video file to the upload area. Then select the spoken language, text output format, and the model to be used. Click "Start Recognition." After a short while, the recognition results will be displayed in the selected format in the bottom text box.
If your machine has an NVIDIA GPU and CUDA is configured correctly, CUDA acceleration will be used automatically.

Source Code Deployment (Linux/Mac/Windows)

Requires python 3.9->3.11
Create an empty directory, such as E:/stt. Open a cmd window in this directory by typing cmd in the address bar and pressing Enter.
Use git to pull the source code to the current directory: git clone [email protected]:jianchang512/stt.git .
Create a virtual environment: python -m venv venv
Activate the environment: On Windows, use the command %cd%/venv/scripts/activate. On Linux and Mac, use the command source ./venv/bin/activate
Install dependencies: pip install -r requirements.txt. If you encounter version conflict errors, please run pip install -r requirements.txt --no-deps
On Windows, extract ffmpeg.7z. Place ffmpeg.exe and ffprobe.exe in the project directory. On Linux and Mac, download the corresponding version of ffmpeg from the ffmpeg website, extract the ffmpeg and ffprobe binaries, and place them in the project root directory.
Download the model archive. Download the model as needed. After downloading, place the xx.pt file in the models folder in the project root directory.
Execute python start.py. Wait for the local browser window to open automatically.

Speech-to-Text Tool ​

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment ​

Source Code Deployment (Linux/Mac/Windows) ​

Speech-to-Text Tool

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment

Source Code Deployment (Linux/Mac/Windows)