Skip to content

Three-Step Reflection Method for SRT Subtitle Translation

This tool has been packaged as an exe file. Download and unzip it, then double-click app.exe to use it. For detailed usage instructions and principles, please continue reading this article.

Download address: https://github.com/jianchang512/ai2srt/releases/download/v0.2/windows-ai2srt-0.2.7z

Andrew Ng's "Reflective Three-Step Translation Method" is very effective. It improves translation quality by allowing the model to self-examine the translation results and suggest improvements. However, applying this method directly to SRT format subtitle translation presents some challenges.

Special Requirements of SRT Subtitle Format

The SRT format has strict formatting requirements:

  • First line: Line number
  • Second line: Two timestamps connected by -->, in the format hours:minutes:seconds,milliseconds
  • Third line and beyond: Subtitle text content

Two blank lines separate subtitles.

Example:

1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

Common Problems in SRT Translation

When using AI to translate SRT subtitles, the following problems may occur:

  • Format errors:
    • Missing line numbers or duplicate timestamps
    • Translating English symbols in timestamps into Chinese symbols
    • Merging the text of two adjacent subtitles into one line, especially when the previous and following sentences form a complete sentence grammatically.
  • Translation quality issues:
    • Even with strict prompt word constraints, translation errors often occur.

Examples of Common Errors:

  • Subtitle text merging resulting in blank lines

image.png

  • Format confusion

image.png

  • Line numbers translated

image.png

  • Inconsistent number of original and translated subtitles

As mentioned above, when the previous and following subtitles are grammatically part of one sentence, they may be translated into a single subtitle, resulting in a missing subtitle.

image.png

Format errors directly prevent subsequent processes that rely on SRT files from proceeding. Different models have different errors and error rates. Relatively speaking, the more intelligent the model, the more likely it is to return legal and compliant content, while small-scale locally deployed models are almost unusable.

However, given the improvement in translation quality offered by the three-step reflection method, I still tried it. I ultimately chose to use gemini-1.5-flash for a small experiment, mainly because of its sufficient intelligence and free access. Aside from the frequency limit, it's almost unrestricted.

Prompt Writing Ideas

Following Andrew Ng's three-step reflection workflow, write prompts:

  • Step one requires the AI to translate literally.
  • Step two requires evaluating the literal translation and providing optimization suggestions.
  • Step three requires re-translation based on the optimization suggestions.

The difference is that the returned content must be in a legal SRT format, although it may not be perfectly compliant.

Building a Simple API

One problem with the three-step reflection mode is that it consumes far more tokens. Prompts become longer, output results become longer, and due to Gemini's frequency limitations, exceeding the frequency limit results in a 429 error. A pause is required after each request.

Using Flask to build a backend API, and Bootstrap5 in the front-end to create a simple single-page application. The overall interface is as follows:

image.png

Obviously, using Gemini in China requires a VPN.

  • Subtitles per translation: This refers to the number of subtitle lines in a single translation request. If it's too large, it may exceed the token limit, causing errors; if it's too small, it's meaningless. It is recommended to keep it within 30-100, with a default of 50.
  • Pause after translation (seconds): This prevents overly frequent requests resulting in a 429 error. A 10-second pause will be added after each request before the next request is initiated.

Example of the return result:

<step1_initial_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this photo in Nature,

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
There\'s also this circle of orange light,

8
00:00:21,900 --> 00:00:23,740
This is a new drama-scale sweet donut,

9
00:00:24,380 --> 00:00:25,640
This is a portal.

10
00:00:26,280 --> 00:00:28,100
This is the generation ring of an alien civilization,

</step1_initial_translation>

<step2_reflection>
* **Line 1:** "Five-star system" is likely a mistranslation.  It probably refers to a five-member committee or group, not a star system. Clarify the context.
* **Line 2:** "Multiple universes" seems like an over-exaggeration.  Rephrase for clarity and accuracy.
* **Line 3:** "Weibo" should be explained as a Chinese social media platform.  "Filming missions" is unclear.  Does it mean "posting videos/images"?
* **Line 8:** "Drama-scale sweet donut" is a nonsensical literal translation. Figure out the intended meaning.
* **Line 9:** "Portal" seems out of context.  Verify the intended meaning.
* **Line 10:** "Generation ring" is likely a mistranslation.  Clarify the context.
* **Line 11:** "Organic polycyclic aromatic hydrocarbons" is overly technical for a general audience. Simplify if possible.
* **Line 12 and 14:** Use the correct formatting for the galaxy\'s name: SPT0418-47.
* **Line 15:** "It hasn\'t been shortened" is awkward. Remove or rephrase.
* **Line 28:**  The name of the organization and the resource should be translated more naturally and accurately.  Consider breaking this long line into two for better readability.
* **Line 29:**  "Cute plush dolls" may sound childish. Consider rephrasing as "animated characters" or similar.
* **Line 35:** "James Webb Space Telescope" should be used consistently throughout. Shortening to "Webb Telescope" after the first mention is acceptable.
* **Line 44:** "SPD048" is likely a typo.  It should be SPT0418-47 to be consistent.
* **Line 45-46:**  "Standard beautiful photo" is redundant.  Simplify to "beautiful photo".
* **Line 48:** "Grovitational Lenshin" is a typo. Correct to "Gravitational Lensing".
* **Line 50:** The sentence is incomplete. Finish the thought.

</step2_reflection>

<step3_refined_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-member group\'s area of focus.

2
00:00:04,720 --> 00:00:06,780
We are still far from making contact with extraterrestrial life.

3
00:00:07,260 --> 00:00:09,880
The James Webb Space Telescope has been capturing images for a year now,

4
00:00:10,140 --> 00:00:12,920
and has recently transmitted many previously unseen photos.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this image in Nature.

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
there\'s a ring of orange light.

8
00:00:21,900 --> 00:00:23,740
This is a large, ring-shaped structure.

9
00:00:24,380 --> 00:00:25,640
This is being investigated.

10
00:00:26,280 --> 00:00:28,100
This is thought to be a sign of an early galaxy.

</step3_refined_translation>

The text within the <step3_refined_translation></step3_refined_translation> tags is extracted as the translation result.

A simple package is available for download and local testing

Download, unzip, and double-click app.exe to automatically open the UI interface in your browser. Enter your Gemini API key, fill in the proxy address, select the SRT subtitle file to be translated, and select the target language to try the results.

image.png

Q1: What is the difference between the reflection workflow and traditional machine translation?

A1: The reflection workflow introduces a self-assessment and optimization mechanism, simulating the thinking process of a human translator, and can produce more accurate and natural translation results.

Q2: How long does it take to use the reflection workflow?

A2: Although the reflection workflow requires multiple AI processes, it usually only takes 10-20 seconds longer than traditional methods. Considering the improvement in translation quality, this time investment is worthwhile.

A3: No, blank lines or inconsistencies with the original number of subtitles may still occur. For example, if the following subtitle only contains 3-5 words and is grammatically a continuation of the previous sentence, the translation result may be merged into a single subtitle.



A function has been added to the small tool to support simultaneous uploading of video or audio files. With the help of Gemini, audio and video can be converted into subtitles, and translation can be performed simultaneously, returning the translation results.

The Gemini large model itself supports both text and audio/video formats, so a single request can achieve transcription of audio/video into subtitles and translation.

For example, an English-spoken video is sent to Gemini, and the translation is specified as Chinese, then the return will be Chinese subtitles.

image.png

image.png

1. Subtitle Translation Only

You can paste SRT format subtitle content into the left text box, or directly click the "Upload SRT Subtitles" button to select a subtitle file from your local computer.

Then set the target language you want to translate to, and you can use the "Three-Step Reflection Translation Method" to instruct Gemini to perform the translation task. The results are output to the right text box, and you can click the "Download" button in the lower right corner to save it as an SRT file locally.

2. Transcribe Audio/Video to Subtitles

Click the "Upload Audio/Video for Transcription" button on the left to select and upload any audio or video file. After uploading, submit the file. Gemini will process it and return the subtitles recognized from the speech in the audio/video. The effect is quite good.

If the target language is also specified, Gemini will continue to translate the recognized result into the language you specified and then return it. That is, it simultaneously completes the two tasks of subtitle generation and subtitle translation.

Download Address:

https://github.com/jianchang512/ai2srt/releases/download/v0.2/windows-ai2srt-0.2.7z