Adjusting VAD Parameters in Speech Recognition
During the speech recognition stage of video translation, the generated subtitles can sometimes be too long (tens of seconds or even minutes) or too short (less than a second). By adjusting the VAD (Voice Activity Detection) parameters, these issues can be optimized, making subtitles better match the actual speech content.
What is VAD
VAD is a voice activity detection tool used to identify speech portions in audio and separate them from silence or noise. It can be used in conjunction with speech recognition tools (such as Whisper) to detect and segment speech segments before and after recognition, thereby improving recognition performance.
Starting from version 3.92, the default VAD model is ten-vad. You can manually switch to silero via Menu > Tools > Advanced Options.
Parameter Details and Adjustment Suggestions

Voice Threshold: Represents the minimum probability for an audio segment to be considered speech. VAD calculates speech probability for each audio segment. Segments above this threshold are considered speech, while those below are considered silence or noise. Smaller values make it more sensitive but may mistakenly classify noise as speech.Maximum Speech Duration (seconds): Limits the maximum length of a single speech segment. Forced segmentation occurs when this duration is exceeded. Enter a number in seconds.Minimum Speech Duration (milliseconds): The minimum duration for speech. If a subtitle's duration is shorter than this value (in ms), it will be attempted to merge with adjacent subtitles. Unit is milliseconds.Merge Short Subtitles with Adjacent: Only when this option is checked will short subtitles be merged.Silence Segmentation Duration (milliseconds): After speech ends, the required silence time must reach this value before segmentation occurs. Enter a number in ms, meaning segmentation only occurs at silence segments longer than this value.Select VAD: Choose which VAD to use.No Speech Threshold: Reducing this can decrease hallucinations but may miss some text.Sampling Temperature: The sampling temperature.Hotwords: Tells the model which words might appear. Separate multiple words with English commas.Repetition Penalty: Increasing this value helps reduce repetition.Text Compression Rate: Reducing this value helps reduce repetition.Pre-segment Audio for Whisper?: Whether to pre-cut audio into sentence segments before sending to the Whisper model for recognition. If using clone voice roles, please select this option, set the minimum speech duration to 3000, and maximum speech duration to 10, to improve voice cloning reliability.
