Danny Siu

Danny Siu

The following project is about video games, but this workflow can also power many real world applications. You can use this same method to digitize multi-page reports, consolidate photographed receipts, or reconstruct contracts from a series of images. Basically, it’s a technique to convert fragmented sets of images into a single, usable piece of text.

The Problem

Fan-run wikis are a cornerstone of any gaming community. While recently playing through an RPG, I often turned to its wiki for lore details. However, I noticed that some of the in-game entries were missing. As someone who has benefited immensely from these community-driven projects, I felt it was a good time to contribute.

The task was to transcribe a large amount of text from the game. Manually typing everything out is a slow and tedious process, so the clear path forward was to build an automated workflow.


Data Collection

In some video games people have performed “datamines” of videogame files to directly extract text and audio files from the game. It would have been simple to parse pure text information and paste it into the Wiki. However, I didn’t have that luxury here. So, I had to collect screenshots of the missing descriptors and organize the data in a way that would be easy for a data pipeline to ingest. 

To keep it simple, I created a Google Sheets with 4 columns:

  • Name: The name of the item
  • Images: List of image names in the format of {item_name}_{n_img}.jpg where n_img is equal to the unique screenshots of text for that item.
  • Text: Transcribed text from the images
  • Location: Where to find the item

I then hopped into the game for a few hours and methodically collected the data.

The First Pass: A Simple OCR Script

The first step was easy enough. Optical Character Recognition (OCR) is a solved problem, and amazing libraries like paddleocr make it almost trivial. With a few lines of Python, you can pull text directly from an image.

Here’s a quick look at how simple it is to get started.

I ran a quick test on a screenshot, and it worked like a charm! I thought I’d be done in an hour. I was wrong.


The Real Challenge: Reconstructing Scrolling Text

I soon hit a significant roadblock. Many of the descriptions were too long to fit on a single screen, forcing me to capture them in a series of scrolling screenshots.

Suddenly, this was no longer a simple OCR task. It had become a text-reconstruction puzzle. I had a folder of images where single messages were fragmented, and despite my best efforts, the screenshots weren’t always saved in the correct order. Sometimes there would be up to 20 screenshots associated with a single item, so manually piecing the text together would be just as time-consuming as typing it from scratch. This was clearly a job for a Large Language Model (LLM).

LLM Powered Achievements

Instead of just extracting text, I needed a tool that could understand the context of the text to stitch it back together intelligently. The plan was to feed all the screenshots for a given terminal to an LLM and have it produce the single, correct transcription. At <$0.01/image this was an excellent bang for my buck.

This process came with a couple of practical hurdles. For instance, sometimes my initial attempts using GPT-4o were blocked due to safety features, for some reason stating it “doesn’t do OCR” or that there is “harmful content”. A simple switch to a different model (GPT 4.1) easily resolved the issue.

The core of the solution was a carefully engineered prompt. The strategy involved several key components:

  • Set the Scene: I started by telling the model it was looking at in-game computer terminals, giving it essential context.
  • Identify the Structure: I instructed it to use the FROM:, TO:, and SUBJECT: header as an anchor to identify the beginning of a new message.
  • Few-Shot Prompting: A key technique was to provide the model with examples. I included a “negative example” of what not to do (incorrectly splitting a message) and a “positive example” of the desired, merged output. This is far more effective than just describing the instructions.
  • Handle Messy Input: Finally, I instructed the model to expect cut-off messages and to filter out irrelevant UI noise, like menu numbers at the bottom of the screen.

Here is the final prompt that brought all these elements together:

 

The result was a robust workflow that reliably converted a messy folder of screenshots into perfectly formatted text, ready for the wiki. There were a few bad merges that I had to manually fix, but I would say that the OCR text was nearly perfect with the GPT-4o and 4.1 models. I was able to save significant time here otherwise transcribing the images. 

Conclusion

It was a fascinating exercise that went from a simple script to a more sophisticated AI solution, and a great example of how modern tools can be applied to solve even the most niche of problems. And of course, this simple concept can be extended to other real-world applications with multi-page documents and images. 

Leave a Reply

Your email address will not be published. Required fields are marked *