The Problem
Fan-run wikis are a cornerstone of any gaming community. While recently playing through an RPG, I often turned to its wiki for lore details. However, I noticed that some of the in-game entries were missing. As someone who has benefited immensely from these community-driven projects, I felt it was a good time to contribute.
The task was to transcribe a large amount of text from the game. Manually typing everything out is a slow and tedious process, so the clear path forward was to build an automated workflow.
Data Collection
In some video games people have performed “datamines” of videogame files to directly extract text and audio files from the game. It would have been simple to parse pure text information and paste it into the Wiki. However, I didn’t have that luxury here. So, I had to collect screenshots of the missing descriptors and organize the data in a way that would be easy for a data pipeline to ingest.
To keep it simple, I created a Google Sheets with 4 columns:
- Name: The name of the item
- Images: List of image names in the format of {item_name}_{n_img}.jpg where n_img is equal to the unique screenshots of text for that item.
- Text: Transcribed text from the images
- Location: Where to find the item
I then hopped into the game for a few hours and methodically collected the data.
The First Pass: A Simple OCR Script
The first step was easy enough. Optical Character Recognition (OCR) is a solved problem, and amazing libraries like paddleocr make it almost trivial. With a few lines of Python, you can pull text directly from an image.
Here’s a quick look at how simple it is to get started.
import numpy as np
from paddleocr import PaddleOCR
from PIL import Image
ocr = PaddleOCR(lang="en")
img = Image.open("./sample.png")
ocr.predict(np.array(img))
I ran a quick test on a screenshot, and it worked like a charm! I thought I’d be done in an hour. I was wrong.
The Real Challenge: Reconstructing Scrolling Text
I soon hit a significant roadblock. Many of the descriptions were too long to fit on a single screen, forcing me to capture them in a series of scrolling screenshots.
Suddenly, this was no longer a simple OCR task. It had become a text-reconstruction puzzle. I had a folder of images where single messages were fragmented, and despite my best efforts, the screenshots weren’t always saved in the correct order. Sometimes there would be up to 20 screenshots associated with a single item, so manually piecing the text together would be just as time-consuming as typing it from scratch. This was clearly a job for a Large Language Model (LLM).
LLM Powered Achievements
Instead of just extracting text, I needed a tool that could understand the context of the text to stitch it back together intelligently. The plan was to feed all the screenshots for a given terminal to an LLM and have it produce the single, correct transcription. At <$0.01/image this was an excellent bang for my buck.
This process came with a couple of practical hurdles. For instance, sometimes my initial attempts using GPT-4o were blocked due to safety features, for some reason stating it “doesn’t do OCR” or that there is “harmful content”. A simple switch to a different model (GPT 4.1) easily resolved the issue.
The core of the solution was a carefully engineered prompt. The strategy involved several key components:
- Set the Scene: I started by telling the model it was looking at in-game computer terminals, giving it essential context.
- Identify the Structure: I instructed it to use the FROM:, TO:, and SUBJECT: header as an anchor to identify the beginning of a new message.
- Few-Shot Prompting: A key technique was to provide the model with examples. I included a “negative example” of what not to do (incorrectly splitting a message) and a “positive example” of the desired, merged output. This is far more effective than just describing the instructions.
- Handle Messy Input: Finally, I instructed the model to expect cut-off messages and to filter out irrelevant UI noise, like menu numbers at the bottom of the screen.
Here is the final prompt that brought all these elements together:
Please help OCR the following images. These images represent a computer terminal in a video game, where each note can be opened.
At the top of each image there is a title
Afterwards, there will be text for that message. However, since the text cannot fit the whole page in one screenshot, there will be many consecutive images. Please do not include any other text, just the text from the terminal.
If you see a message that is cut off, please continue it in the next image. Please do not split it into multiple messages. For example avoid the following scenario output:
—
FROM: Sender
This is an example sentence
From page 1
—-
This is an example sentence
From page 1
And now on page 2
End message
—
Instead, I want the text combined like this
—
FROM: Sender
This is an example sentence
From page 1
And now on page 2
End message
The result was a robust workflow that reliably converted a messy folder of screenshots into perfectly formatted text, ready for the wiki. There were a few bad merges that I had to manually fix, but I would say that the OCR text was nearly perfect with the GPT-4o and 4.1 models. I was able to save significant time here otherwise transcribing the images.
Conclusion
It was a fascinating exercise that went from a simple script to a more sophisticated AI solution, and a great example of how modern tools can be applied to solve even the most niche of problems. And of course, this simple concept can be extended to other real-world applications with multi-page documents and images.