VAD Parameter Adjustment in Speech Recognition

Subtitles created during the speech recognition phase of video translation are sometimes very long, tens of seconds or even minutes, while sometimes they are very short, less than 1 second. These can all be optimized by adjusting VAD parameters.

What is VAD

GitHub: https://github.com/snakers4/silero-vad

Silero VAD is an efficient Voice Activity Detection (VAD) tool that can identify whether an audio contains speech and separate the speech segments from silence or noise. Silero VAD can be used in conjunction with other speech recognition libraries (such as Whisper) to detect and segment speech segments before or after speech recognition, optimizing recognition results.

Faster-whisper uses VAD by default for speech analysis and segmentation, mainly involving the following four parameters to control and adjust the segmentation and recognition effects. These parameters are used to control the judgment and segmentation of speech and silence. The following is a detailed explanation and setting suggestions for each parameter:

threshold (Threshold)

Meaning: Represents the probability threshold of speech. Silero VAD outputs the speech probability of each audio segment. Probabilities above this value are considered speech (SPEECH), while probabilities below this value are considered silence or background noise.

Setting Suggestions: The default value is 0.5, which is applicable in most cases. However, for different datasets, you can adjust this value to more accurately distinguish between speech and noise. If you find too many misjudgments, you can try increasing it to 0.6 or 0.7; if too many speech segments are lost, you can decrease it to 0.3 or 0.4.

min_speech_duration_ms (Minimum Speech Duration, unit: milliseconds)

Meaning: If the length of the detected speech segment is less than this value, the speech segment will be discarded. The purpose is to remove some short non-speech sounds or noise.

Setting Suggestions: The default value is 250 milliseconds, suitable for most scenarios. You can adjust it as needed. If speech segments that are too short are easily misjudged as noise, you can increase this value, such as setting it to 500 milliseconds.

max_speech_duration_s (Maximum Speech Duration, unit: seconds)

Meaning: The maximum length of a single speech segment. If a speech segment exceeds this duration, it will try to split at silent points longer than 100 milliseconds. If no silent point is found, it will forcefully split before this duration to avoid excessively long continuous segments.

Setting Suggestions: The default is infinity (unlimited). If you need to process longer speech segments, you can keep the default value; but if you want to control the segment length, such as processing dialogues or segmented output, you can set it according to your specific needs, such as 10 seconds or 30 seconds.

min_silence_duration_ms (Minimum Silence Duration, unit: milliseconds)

Meaning: The waiting silence time after speech is detected. Only when the silence duration exceeds this value will the speech segment be split.

Setting Suggestions: The default value is 2000 milliseconds (2 seconds). If you want to detect and segment speech segments more quickly, you can decrease this value, such as setting it to 500 milliseconds; if you want looser segmentation, you can increase it.

speech_pad_ms (Speech Padding Time, unit: milliseconds)

Meaning: The padding time added before and after the detected speech segment to avoid the speech segment being cut too tightly, which may cut off some edge speech.

Setting Suggestions: The default value is 400 milliseconds. If you find that the cut speech segments are missing parts, you can increase this value, such as 500 milliseconds or 800 milliseconds. Conversely, if the speech segments are too long or contain too much invalid content, you can reduce this value.

The specific settings of these parameters need to be fine-tuned according to the speech dataset and application scenario you are using. Reasonable configuration can significantly improve the performance of VAD.

The above parameters can be modified and adjusted in Menu--Tools/Options--Advanced Options--faster/openai. You can also select faster-whisper local after speech recognition in the main interface, click the "Speech Recognition" text on the left, and the modification text boxes for these parameters will be displayed below.

Summary:

threshold: Can be adjusted according to the dataset, the default value of 0.5 is more common.

min_speech_duration_ms and min_silence_duration_ms: Determine the length of the speech segment and the sensitivity of the silence segmentation, fine-tune according to the application scenario.

max_speech_duration_s: Prevents unreasonable growth of long speech segments, usually needs to be set according to the specific application.

speech_pad_ms: Adds a buffer to the speech segment to avoid excessive cutting of the segment. The specific value depends on your audio data and the requirements for speech segmentation.

The cleaner and clearer the sound is without noise, the better the recognition effect will be. Even carefully adjusted parameters are not as good as the effect of a clean background sound.

VAD Parameter Adjustment in Speech Recognition ​

What is VAD ​

threshold (Threshold) ​

min_speech_duration_ms (Minimum Speech Duration, unit: milliseconds) ​

max_speech_duration_s (Maximum Speech Duration, unit: seconds) ​

min_silence_duration_ms (Minimum Silence Duration, unit: milliseconds) ​

speech_pad_ms (Speech Padding Time, unit: milliseconds) ​

Summary: ​