Skip to content

> This article introduces an online web-based real-time speech recognition tool. It supports real-time microphone recording and recognition, as well as speech recognition from audio and video files, and offers free use (without usage restrictions). >

https://stt.pyvideotrans.com

<video controls src="/img/webstt.mp4"></video>

Speech recognition technology, also known as speech-to-text, utilizes artificial intelligence to convert speech from audio or video into text. This technology has wide-ranging applications in many fields, such as meeting recording, voice assistants, and subtitle generation.

Currently, there are two main methods of speech recognition:

1. Offline Model-Based Speech Recognition:

This method requires deploying a speech recognition model on a local computer. A popular open-source solution is OpenAI Whisper. After downloading its large model (such as large-v2), it can be used offline without an internet connection or payment.

However, this method requires substantial computing resources (such as a powerful graphics card), otherwise the recognition speed will be slow, and the accuracy will decrease.

2. Online API-Based Speech Recognition:

Some companies provide online speech recognition API services, such as ByteDance and OpenAI.

Users only need to upload audio data to the API to obtain transcription results.

This method requires no local hardware resources, is fast, and has high accuracy, but requires payment of a certain fee.

Real-time Speech Recognition

The above two methods mainly target existing audio or video files. So, how do you perform real-time transcription of audio streams recorded in real-time from a microphone? For example, how do you record speech in a meeting in real-time and convert it into text?

Real-time speech recognition is similar in principle to file transcription, but it is more technically challenging. It requires:

  • Real-time Data Stream Processing: Continuously receiving audio data from the microphone.
  • Data Slicing and Recognition: Dividing the continuous audio stream into smaller segments and recognizing them one by one.
  • Result Integration and Error Correction: Integrating the recognition results of each segment and performing error correction to improve the accuracy of the final transcription. This usually requires more complex algorithms to handle pauses and overlaps in speech.
  • Minimum Latency: Minimizing the delay from audio input to text output to ensure real-time performance.

Technical Principles and Usage Introduction

image.png

  • Real-time Microphone Recording and Recognition: Uses the microphone to record audio in real-time and perform real-time transcription.
  • Audio and Video File Speech Recognition: Supports uploading local audio or video files for transcription.

Technical Principles:

  1. Lightweight Speech Recognition Model (Vosk): To run in a browser environment, we use the small Vosk speech recognition model. Although its accuracy is relatively low, it effectively reduces resource consumption and ensures smooth operation in the browser.

  2. Local Audio Processing (ffmpeg.wasm): Uses ffmpeg.wasm to process audio and video files and extract speech within the user's browser, without needing to upload audio data to the server.

  3. Client-Side Model Loading: The speech recognition model is downloaded and runs in the browser's memory. This limits our ability to use larger, more accurate models, and only smaller models can be chosen to avoid browser crashes. Even if the user's computer has powerful performance, large models are not currently supported due to server bandwidth limitations.

How to Use

  1. Model Loading: Before using, load the Chinese or English model as needed.
  2. Microphone Recognition: Click the button in the left area to start using the microphone for real-time recording and recognition. The recognition results will be displayed in the text box in real-time.
  3. File Recognition: Select a local audio or video file in the right area. The tool will use ffmpeg.wasm for local processing and speech recognition. The results are displayed in the text box.
  4. Result Download: The transcribed text can be downloaded as a TXT file.

Precautions

  1. Mutually Exclusive Functions: The real-time microphone recognition and file recognition functions cannot be used simultaneously.
  2. Local Processing: The model and audio processing are all performed locally in the user's browser.
  3. Language Support: Currently, only Chinese and English speech recognition are supported.
  4. Performance Limitations: Due to the use of a lightweight model, the recognition accuracy may be lower than that of larger models.

Frequently Asked Questions

  • Q: What should I do if the recognition accuracy is low? A: We use a lightweight model to ensure browser compatibility and speed. If you need higher accuracy, it is recommended to download pyVideoTrans and use the large-v2 model locally.
  • Q: What languages are supported? A: Currently, only Chinese and English are supported.
  • Q: Why is it slow? A: This may be due to network conditions, browser performance, or insufficient computer resources.
  • Q: How large of a file can I upload? A: File size is limited by browser memory and processing capabilities.