GC-101 · Module 2

Multimodal Input & Vibe Coding

3 min read

Gemini CLI is natively multimodal. You can pass images, screenshots, PDFs, diagrams, and even audio files directly into your prompts using @ file references. This transforms workflows that traditionally require verbal descriptions of visual content — instead of describing a UI bug, screenshot it. Instead of explaining a database schema, paste the diagram.

# Pass a screenshot for UI work
@screenshot.png "Match this design but use our color tokens"

# Reference a PDF specification
@docs/api-spec.pdf "Implement the /users endpoint from this spec"

# Paste an image from clipboard (Windows)
Alt+V then describe what you want

# Reference an architecture diagram
@diagrams/system-arch.png "Add a caching layer between the API and database"

"Vibe coding" is the practice of using natural language and visual references to generate code without writing much yourself. Gemini CLI excels at this pattern. Start with a screenshot or sketch, describe the behavior in plain English, and let Gemini generate the implementation. Then iterate conversationally: "make the sidebar collapsible," "add loading states," "the spacing is off on mobile." This workflow is particularly powerful for prototyping and UI development.

The sketch-to-code pattern deserves special attention. Take a photo of a whiteboard sketch or hand-drawn wireframe, pass it to Gemini with @sketch.jpg, and ask for a working implementation. Gemini interprets the visual layout, identifies components, infers relationships, and generates structured code. It won't be pixel-perfect, but it gets you 70-80% of the way — a massive head start versus building from a blank file.