CDX-101 · Module 2

Dictation & Multimodal Input

3 min read

Codex supports multimodal input through the -i flag, which lets you pass images (screenshots, diagrams, wireframes) alongside your text prompt. This is transformative for frontend work — take a screenshot of a design mockup or a visual bug, pass it to Codex, and let it see what you see.

The vision capabilities vary by model. GPT-5.x-Codex models have the strongest vision understanding, while GPT-4.1 provides solid baseline image comprehension. For best results, use clear, high-resolution screenshots and always pair them with a text description of what you want done.

# Pass a screenshot with your prompt
codex -i screenshot.png "the login button is misaligned — fix the CSS"

# Pass a wireframe for implementation
codex -i wireframe.png "implement this dashboard layout using our existing component library"

# Pass multiple images
codex -i before.png -i after.png "make the current page look like the 'after' design"

Voice-to-text workflows are an emerging pattern with Codex. Using your OS dictation feature (macOS Dictation, Windows Voice Typing, or a third-party tool), you can speak your prompt naturally and let speech-to-text handle the transcription. This is particularly effective for brain-dump style prompts where you want to describe a problem conversationally.

The pattern works best when you dictate freely, then ask Codex to clean up and confirm the intent before executing: "I just dictated a messy description — summarize what I want in 3 bullet points and confirm before proceeding."

Do This

  • Use -i for UI bugs, design implementation, and visual regression fixes
  • Combine voice dictation with a "summarize back to me" step for accuracy
  • Use high-resolution, tightly cropped screenshots focused on the relevant area

Avoid This

  • Send a full-screen screenshot when the issue is in one small component
  • Rely solely on images without text context
  • Dictate and execute in full-auto without reviewing the transcription first