CDX-101 · Module 2
Dictation & Multimodal Input
3 min read
Codex supports multimodal input through the -i flag, which lets you pass images (screenshots, diagrams, wireframes) alongside your text prompt. This is transformative for frontend work — take a screenshot of a design mockup or a visual bug, pass it to Codex, and let it see what you see.
The vision capabilities vary by model. GPT-5.x-Codex models have the strongest vision understanding, while GPT-4.1 provides solid baseline image comprehension. For best results, use clear, high-resolution screenshots and always pair them with a text description of what you want done.
# Pass a screenshot with your prompt
codex -i screenshot.png "the login button is misaligned — fix the CSS"
# Pass a wireframe for implementation
codex -i wireframe.png "implement this dashboard layout using our existing component library"
# Pass multiple images
codex -i before.png -i after.png "make the current page look like the 'after' design"
Voice-to-text workflows are an emerging pattern with Codex. Using your OS dictation feature (macOS Dictation, Windows Voice Typing, or a third-party tool), you can speak your prompt naturally and let speech-to-text handle the transcription. This is particularly effective for brain-dump style prompts where you want to describe a problem conversationally.
The pattern works best when you dictate freely, then ask Codex to clean up and confirm the intent before executing: "I just dictated a messy description — summarize what I want in 3 bullet points and confirm before proceeding."
Do This
- Use -i for UI bugs, design implementation, and visual regression fixes
- Combine voice dictation with a "summarize back to me" step for accuracy
- Use high-resolution, tightly cropped screenshots focused on the relevant area
Avoid This
- Send a full-screen screenshot when the issue is in one small component
- Rely solely on images without text context
- Dictate and execute in full-auto without reviewing the transcription first