Skip to content

Difference between Holistic Recognition and Even Segmentation

Holistic Recognition:

This offers the best speech recognition results but consumes the most computer resources. If the video is large and the large-v3 model is used, it may cause crashes.

During recognition, the entire audio file is passed to the model. The model internally uses VAD for segmentation and punctuation. The default silence segmentation is 200ms, and the maximum sentence length is 3s. This can be configured in Menu -- Tools/Options -- Advanced Options -- VAD area.

Even Segmentation:

As the name suggests, this method cuts the audio file into segments of equal length and then passes them to the model. OpenAI models will forcibly use even segmentation. That is, when using an OpenAI model, regardless of whether you select "Holistic Recognition" or "Pre-segmentation," "Even Segmentation" will be used forcibly.

Each segment in even segmentation is 10s long, and the silence segmentation interval is 500ms. This can be configured in Menu -- Tools/Options -- Advanced Options -- VAD area.

Note: With a setting of 10s, each subtitle is roughly 10s long, but the length of each dubbing segment is not necessarily 10s. The pronunciation length and trailing silence will be removed.