A two-stage genAI solution for structured content in Shopify

Real-World Application: Two-Stage AI Product Content Generation

Building on our previous post’s conceptual framework, this article examines the real-world implementation of a two-stage AI content processing workflow that reduced content processing costs by over 90%.

Business Impact

How I saved an e-commerce business $20,000+ in labor costs with $33 in API calls and a couple Python scripts

The Challenge

A client had over 2,000 products on Shopify, with 1,172 “Active” products needing HTML code restructuring and content updating. After acquiring a 30-year-old business and migrating from Magento to Shopify, they lacked the time and capacity to review and restructure product HTML without investing hundreds of hours. Most pages had sufficient content but needed reformatting and cleanup, while others required additional information and a full revamp that was compatible with their custom CSS code.

Implementation: Two-Stage Solution

For this problem, I used the product title, body HTML code, short description meta-data and, if available, product manuals to prompt the genAI models to generate the target content. The goal was to implement a solution that enhances quality and information (useful to the customer) that, hopefully, enhances ranking. Search engines such as google have made updates to stay ahead of the AI spam, so quality over quantity remains essential.

To simplify future edits and empower the client, I used the Matrixify app instead of Shopify’s developer API. The client is already familiar with Matrixify, and it allowed me to deliver results in under 30 days for just one month’s usage ($50). This approach provides structured, reusable data the client can retain or revert to later. Hence, it offered both flexibility and cost savings in the long term.

Stage 1: Content Generation

The content generation focus was aligned with the clients’ goals:

Standardize Header Structure for Efficient Crawling:
- Normalize use of headers (H3 for main, H4 for sections, H5 for subsections)
- Eliminate inconsistent use of H1/H2 which harmed SEO
- Ensure header structure integrates with custom CSS for styling consistency
Responsive, Clean HTML Formatting:
- Reformat HTML to be compatible with their custom CSS, particularly:
  - Bullet lists
  - Image/video/link containers
  - H3–H5 headers
- Avoid inline styles and unnecessary formatting
Fix or Flag Broken Media/Links:
- Automatically identify and report any broken images, videos or links
Short Description Integration:
- Merge “short description” meta-data more fluidly into the full product description, especially when it appears before a proper header
Product manuals:
- Include valuable specifications from the available manuals they have linked on the website and link them where appropriate within the body

Model Choice: GPT-4o (initially tested GPT-4o-mini, but 4o followed instructions better on long HTML pages). I used a lower temperature to reduce “creativity” and make the output more focused, predictable and repeatable (see discussion of temperature in LLMs here)

Python

modify_html_with_gpt(prompt=prompt_user, sys_prompt=prompt_system, model="gpt-4o", max_tokens=8000, temp=.3)

Prompt Templates:

Short content (under 27 words): Focus on enhancement without extreme expansion
General template (27+ words): Comprehensive revision with formatting improvements

I used data-driven cutoffs based on the word count distribution (first quartile at 27 words) across products, with randomized target lengths between first (27) and second (84) quartiles.

While LLMs are bad at counting, my hope was that it would perform decently enough to use the randomization in some way downstream. As expected, the genAI agent often overshot the target word count. This limits my ability to evaluate whether randomly assigning word counts alters key KPIs for the client.

Stage 1 Results:

1,175 products processed
~2.9 million tokens used
Cost breakdown: 25% input tokens, 75% output tokens
Total cost: ~$14

Stage 2: Cross-Model Validation

The goals for the validation were:

Only fix critical problems, leave minor issues alone
Use trusted sources (manufacturer,) for fact-checking
Convert special characters to ASCII-safe equivalents (sometimes the products had bad characters or GPT generated odd characters)
Again, detect broken image/video URLs

Model Choice: Claude-3.7-Sonnet for validation

Python

modify_html_with_anthropic(prompt=prompt_user, sys_prompt=prompt_system, model="claude-3-7-sonnet-latest", max_tokens=8000, temp=.3)

Different model families reduce systematic bias and provide strong analytical capabilities for content evaluation. As some have discussed different strengths of OpenAI versus Anthropic models, I, too, find that sonnet genAI performs relatively well at identifying factual errors, coding bugs and, in some ways, not being as verbose.

Quality Metrics

I evaluated several metrics for Stage 1 and Stage 2, specifically:

Semantic Similarity: Using sentence-transformers/all-MiniLM-L6-v2 model to measure content meaning preservation through cosine similarity in 384-dimensional space (1 = perfect preservation; 0 = no preservation).

Edit Similarity: Compares word arrays to identify unnecessary changes in already adequate content. For example, “I love pizza” vs “Pizza is great” = 0.33 (1 = perfect similarity, 0 = no similarity).

Content Preservation: This evaluate how much of the unique content is preserved in each description. For example, if the original body contains “the red car is fast” and the new body contained “the blue car is slow” there are 3 shared words (the car is) / 7 total unique words (the red blue car is fast slow), so the preservation (or Dice coefficient) = 0.43 (1 = perfect preservation, 0 = no preservation).

As expected, the variability was greatest in these metrics during stage 1. Specifically, the semantic similarly was high between the original and GPT-produced text. Given that we were wanting to restructure more text and, in some cases, gave the model the product manual, it is not surprising that some words changed.

In stage 2, I was focused on “cleaning up”. So my expectation for broad changes across each of the three metrics was low. This was affirmed by the calculated metrics. While some things did get corrected and reorganized, it was to a minimal extent.

The added value was that both stage 1 and stage 2 flagged products with dead links (404) and broken images/videos. Specifically, stage 1 caught errors and stage 2 found additional errors, demonstrating the value of a second pass.

Business Impact

Client Testimonial

>> “It’s the best use of AI that I’ve seen so far”

Processing Costs

Single-stage: $14.75 (OpenAI tokens)
Two-stage: $18.25 (Anthropic tokens)
Total investment: $33 in API costs

Projected Revenue Impact

Based on client’s rolling 365-day data:

50,000 monthly visitors
0.25% baseline conversion rate
$450 average order value

* Conservative estimates: 3% visitor increase + 0.05% conversion improvement = $11,585 additional monthly revenue, easily justifying the investment.

* hypothetical projection.

Implementation Timeline

Phase 1 (Weeks 1-2): Single-stage deployment

Data analysis and pipeline setup
20-product test batch for baseline evaluation
Prompt fine-tuning based on quality metrics
Full catalog processing

Phase 2 (Week 2): Two-stage validation integration

Claude validation layer implementation
Cross-model comparison algorithms
A/B testing against single-stage results

Phase 3 (Weeks 2-8): Manual evaluation and monitoring

Automated flagging system
Waiting for search engines to re-crawl pages
SEO ranking and conversion tracking

Key Results

What Worked:

Cross-model validation reduced hallucinations and caught missed errors
Quality metrics provided objective improvement measurement
Automated HTML cleanup eliminated 99% of formatting inconsistencies

Challenges:

Prompt engineering required 4-6 iterations
Model-specific considerations for JSON output formatting
Some product categories need future restructuring attention

Next Steps:

This was a first pass by the client to take care of the “low hanging fruit”. After my report highlighting:

The degree of variability in word counts across products
Odd use of lists in original posts
Lack of structured use of headers to make information identical across product categories

They intend to revisit and make additional updates using this framework. Given that the code is written, it can be easily reapplied on the descriptions.

Conclusion

This two-stage AI workflow delivered a 95% reduction in processing time while significantly improving content quality. The key to success lies in treating AI as a collaborative tool, using quality metrics for continuous improvement, and designing scalable systems that maintain quality standards.

As a bonus, I provided SEO title and meta description updates for content pages for ~$0.50 in API costs (see genAI for SEO metadata). Demonstrating the multi-faceted utility of these tools.

Want to Implement This Solution?

Want to see how this approach can work for your business and/or products?

Let us know →

Real-world Application: AI-Powered Content Processing

Real-World Application: Two-Stage AI Product Content Generation

Business Impact

The Challenge

Implementation: Two-Stage Solution

Stage 1: Content Generation

Prompt Templates:

Stage 1 Results:

Stage 2: Cross-Model Validation

Quality Metrics

Business Impact

Client Testimonial

Processing Costs

Projected Revenue Impact

Implementation Timeline

Phase 1 (Weeks 1-2): Single-stage deployment

Phase 2 (Week 2): Two-stage validation integration

Phase 3 (Weeks 2-8): Manual evaluation and monitoring

Key Results

What Worked:

Challenges:

Next Steps:

Conclusion

Want to Implement This Solution?

Related Posts

6

Academia to Industry: Recognizing When It’s Time to Move On

30

PHD is Not “Real” Experience: Beyond the Biased Narrative of Academic Research

18

An LLM-Powered Workflow for Video Game Wiki Contributions

4

genAI: Updating Search Engine Metadata for Pennies

2

A Multi-Agent, AI-Powered Content Processing Workflow: A Conceptual Overview

30

All Hype, No Bite: The “Your Brain on ChatGPT..” Preprint.

Our Consultants

Location

Real-world Application: AI-Powered Content Processing

Michael Demidenko

Real-World Application: Two-Stage AI Product Content Generation

Business Impact

The Challenge

Implementation: Two-Stage Solution

Stage 1: Content Generation

Prompt Templates:

Stage 1 Results:

Stage 2: Cross-Model Validation

Quality Metrics

Business Impact

Client Testimonial

Processing Costs

Projected Revenue Impact

Implementation Timeline

Phase 1 (Weeks 1-2): Single-stage deployment

Phase 2 (Week 2): Two-stage validation integration

Phase 3 (Weeks 2-8): Manual evaluation and monitoring

Key Results

What Worked:

Challenges:

Next Steps:

Conclusion

Want to Implement This Solution?

Related Posts

6

Academia to Industry: Recognizing When It’s Time to Move On

30

PHD is Not “Real” Experience: Beyond the Biased Narrative of Academic Research

18

An LLM-Powered Workflow for Video Game Wiki Contributions

4

genAI: Updating Search Engine Metadata for Pennies

2

A Multi-Agent, AI-Powered Content Processing Workflow: A Conceptual Overview

30

All Hype, No Bite: The “Your Brain on ChatGPT..” Preprint.

Our Consultants

Location