GFX-101 · Module 1

How Models See Your Prompt

3 min read

When you type a prompt, the model does not read it the way you do. Your text gets broken into tokens — small word pieces — and each token gets converted into a numerical embedding by a text encoder (usually CLIP or T5). These embeddings live in a high-dimensional space where "sunset" sits near "golden hour" and "warm light" but far from "fluorescent" and "cold." The diffusion model then uses this encoded representation to guide noise removal across dozens of steps, gradually shaping random static into a coherent image.

This architecture explains several counterintuitive behaviors. Word order matters because early tokens receive stronger attention weighting in most models — put your most important concept first. Phrasing matters because "a dog sitting on a hill at sunset" activates different concept clusters than "sunset, dog, hill, sitting" even though a human would read them identically. Prompt length has diminishing returns because the text encoder has a fixed context window (typically 77 tokens for CLIP), and anything beyond that gets truncated or compressed.