GFX-301f · Module 1

Text Rendering in AI Models

4 min read

Text is the hardest element for generative image models. Faces, landscapes, abstract patterns — models handle these well. But ask a model to render "Agent Team Architecture" in Exo 2 Bold at 48px, and the result will be approximately correct at best, illegible at worst. Character forms break down. Kerning is inconsistent. Font identification is unreliable.

The current generation of models (Gemini, DALL-E 3, Midjourney v6) has improved dramatically — short text strings (1-5 words) render correctly approximately 80% of the time. Longer strings degrade. Complex formatting (multiple sizes, weights, colors on the same image) degrades further. Specific font requests are interpreted, not executed — the model approximates the requested font based on training examples, which means "Exo 2 Bold" produces something that looks geometric and sans-serif, but is not actually Exo 2.

The production implication: for any asset where typography must be pixel-perfect (which is all brand assets), text should be added in post-processing, not generated by the model. The model generates the visual canvas — background, composition, imagery, effects. Text is composited afterward using actual font rendering. This hybrid approach produces visuals with AI-quality imagery and designer-quality type.