Speech-to-Text Tool

This is an offline, locally running speech-to-text tool based on the openai-whisper open-source model. It can recognize and convert human voices in videos/audios into text, and can output in JSON format, SRT subtitle format with timestamps, and plain text format. It can be used for self-deployment to replace OpenAI's speech recognition interface or Baidu speech recognition, etc., with accuracy basically equivalent to the official OpenAI API interface.

After deployment or download, double-click start.exe to automatically call the local browser to open the local webpage.
Drag and drop or click to select the audio and video files to be recognized, then select the speaking language, the output text format, and the model to be used (built-in base model). Click Start Recognition, and after the recognition is completed, the selected format will be output on the current webpage.
The whole process does not require networking, runs completely locally, and can be deployed on the internal network.
The openai-whisper open-source model has base/small/medium/large/large-v3. The built-in base model, base->large-v3, has better and better recognition results, but requires more computer resources. You can download it yourself and put it in the models directory as needed.
All Model Download Addresses

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment

Click here to open the Releases page to download the pre-compiled files.
After downloading, extract it to somewhere, such as E:/stt.
Double-click start.exe and wait for the browser window to open automatically.
Click the upload area on the page, find the audio or video file you want to recognize in the pop-up window, or directly drag and drop the audio and video file to the upload area, then select the spoken language, text output format, and the model to be used, and click "Start Recognition Immediately". After a while, the recognition result will be displayed in the bottom text box in the selected format.
If the machine has an NVIDIA GPU and the CUDA environment is configured correctly, CUDA acceleration will be used automatically.

Source Code Deployment (Linux/Mac/Window)

Requires python 3.9->3.11
Create an empty directory, such as E:/stt. Open a cmd window in this directory. The method is to enter cmd in the address bar, and then press Enter.
Use git to pull the source code to the current directory git clone [email protected]:jianchang512/stt.git .
Create a virtual environment python -m venv venv
Activate the environment. The command under win is %cd%/venv/scripts/activate, and the command under linux and Mac is source ./venv/bin/activate
Install dependencies: pip install -r requirements.txt. If there is a version conflict error, please execute pip install -r requirements.txt --no-deps
Under win, unzip ffmpeg.7z and place ffmpeg.exe and ffprobe.exe in the project directory. For linux and mac, go to the ffmpeg official website to download the corresponding version of ffmpeg, unzip the ffmpeg and ffprobe binary programs and put them in the root directory of the project.
Download the model compressed package, download the model as needed, and after downloading, put the xx.pt file in the compressed package into the models folder in the project root directory.
Execute python start.py, and wait for the local browser window to open automatically.

Speech-to-Text Tool ​

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment ​

Source Code Deployment (Linux/Mac/Window) ​

Speech-to-Text Tool

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment

Source Code Deployment (Linux/Mac/Window)