GFX-301a · Module 1

Framework Foundations

3 min read

The PaperBanana paper from Google posed the question: what if we treated image generation like a design agency with a team of specialists?

Researchers found that even though AI could write papers, run experiments, and review literature, it couldn't reliably draw the figures. Their breakthrough insight: stop asking one model to do everything. Instead, build a team — a retriever finds reference images, a planner writes detailed visual descriptions, a stylist applies aesthetic guidelines, and a visualizer generates the image through iterative loops. Each specialist does one thing well.

Five stages form the core pipeline — each handled by a different specialist agent working in sequence.

Stage 1: The retriever scans reference images to understand existing visual patterns. Stage 2: The planner converts the subject matter into a rich visual description — turning science into imagery. Stage 3: The stylist applies aesthetic guidelines to that description. Stage 4: The visualizer generates the image through approximately three rounds of iteration. Stage 5: The critic evaluates the result and triggers another round if needed. The pipeline is sequential by design — each stage's output feeds the next.

Adding multiple rounds of critique from a dedicated agent increased image accuracy by nearly 10 percentage points in the PaperBanana experiments.

Without critique, the pipeline hit about 45.1% accuracy in recreating or adapting reference images. Adding 1-3 rounds of critique from a specialized agent pushed accuracy to approximately 55%. The critique evaluates conciseness, aesthetics, visual polish, and faithfulness to the original intent. This is the single highest-leverage addition to any generation pipeline — a feedback loop that catches what the generator missed.