Why are the generated subtitles inconsistent in length and messy - How to optimize and adjust?
During video translation, the subtitles automatically generated by the speech recognition stage are often unsatisfactory. Either the subtitles are too long, almost filling the screen; or only two or three characters are displayed, appearing scattered. Why does this happen?
Speech Recognition Segmentation Standards
Speech Recognition:
When human speech is converted into text subtitles, it is usually segmented by silent intervals. Generally, the duration of a silent segment is set between 200 milliseconds and 500 milliseconds. Assuming it's set to 250 milliseconds, when a silence lasting 250 milliseconds is detected, the program considers it the end of a sentence. At this time, a subtitle is generated from the previous end point to here.
Factors Affecting Subtitle Effects
- Speaking Speed
If the audio has a fast speaking speed with almost no pauses, or pauses less than 250 milliseconds, the cut subtitles will be very long, possibly lasting tens of seconds or even dozens of seconds, filling the screen when embedded in the video.
- Irregular Pauses:
Conversely, if there are unexpected pauses in the speech, for example, several pauses in the middle of a continuous sentence, the resulting subtitles will be very fragmented, possibly showing only a few words per subtitle.
- Background Noise
Background noise or music can also interfere with the judgment of silent intervals, leading to inaccurate recognition.
- Clarity of Pronunciation: This is obvious; unclear pronunciation is difficult for even humans to understand.
How to Address These Issues?
- Reduce Background Noise:
If there is significant background noise, you can separate the human voice from the background sound before recognition, removing interference to improve recognition accuracy.
- Use Larger Speech Recognition Models:
If computer performance allows, use larger models for recognition, such as large-v2
or large-v3-turbo
.
- Adjust Silent Segment Duration:
The software defaults to 200 milliseconds for silent segments. You can adjust this value based on the specific audio and video. If you want to recognize video with a faster speaking speed, you can reduce it to 100 milliseconds; if there are more pauses, you can increase it to 300 or 500 milliseconds. The setting method is to open Tools/Options in the menu, then select Advanced Options, and modify the minimum silent segment value in the faster/openai speech recognition adjustment section.
- Set Maximum Subtitle Duration:
You can set a maximum duration for subtitles; subtitles exceeding this duration will be forcibly segmented. This setting is also in the Advanced Options.
As shown in the figure, subtitles exceeding 10 seconds will be re-segmented.
- Set Maximum Characters Per Subtitle Line:
You can set the maximum number of characters per subtitle line; subtitles exceeding the character limit will automatically wrap or be segmented.
- Enable Re-segmentation Function: With this option enabled, combined with settings 4 and 5 above, the program will automatically re-segment.
After settings 3, 4, 5, and 6, the program will first generate subtitles based on silent intervals. When encountering excessively long subtitles or too many characters, the program will re-segment the subtitles. When re-segmenting subtitles, the program uses the nltk
natural language processing library, combining silent interval duration, punctuation marks, and subtitle character count for comprehensive judgment before segmentation.