Images Produced in Dialogue

ITP Thesis, NYU 2026 • Software Research

Most AI creative tools produce ontologically similar artefacts from previous technology with degraded quality, or just make things faster. They miss the question: what becomes possible that wasn't before? AI tools carry the totality of human epistemology in their training data, but they are biased toward statistical probability. They give you what is most likely, not what is most yours.

What if image creation were treated as dialogue, not transaction. You speak and draw simultaneously, operating in two registers: voice which is semantic, gesture which is spatial. The tool triangulates meaning from both. The gap between what you meant and what it heard is where the conversation happens. You learn the tool's language. It learns yours. The image is a trace of exchange.

Contribution

Software Research

Team

Solo

Demo of Final prototype. (listen with sound on)

Drawing Defines What Gets Created

Geoffrey Bawa's architecture office in Sri Lanka developed what they called "the shaky effect": overscoring clean technical lines with freehand ink pens, rendering flora with botanical authenticity. The drawing tool enforced a different kind of observation. The buildings changed because the drawings changed because the tools changed.

Don Ihde's postphenomenology maps four human-technology relations that apply directly here. A pen on paper is embodiment: the tool withdraws from consciousness. iPad vector smoothing is hermeneutic: you read what the algorithm tells you about your gesture. AI-mediated drawing is alterity: you negotiate with a quasi-other that has its own tendencies. And blind drawing, where you only see the AI's interpretation, restructures the feedback loop entirely. Each step changes the relationship, not just the speed.

A comparison of the product before (left) and the after.

The Gap

Most AI drawing tools collapse everything into one channel: text prompting. This project keeps two channels separate and lets them inform each other. Voice is semantic: "the sun," "next to," "maybe something like..." It carries intent, emphasis, uncertainty, self-correction in real time. Gesture is spatial: where you draw, how big, the physical relationship between shapes. Vision bridges them. The system sees "the sun" rather than matching metadata, using Claude Vision to understand what is on the canvas semantically.

The gap between what you meant and what the system heard is where the conversation happens. Sol LeWitt's wall drawings begin with instructions that others execute, deliberately vague so the end result is not completely controlled. The gap between instruction and execution is where something new appears.

The System

Four concurrent pipelines run simultaneously. The Drawing pipeline captures pen input through tldraw, creating vector shapes. The Voice pipeline handles continuous listening through ElevenLabs, accumulating transcripts and cleaning them for refined prompts. The Render pipeline captures the canvas as a snapshot every ~100ms and passes it through FAL.ai's Latent Consistency Model for real-time image generation. The Transform pipeline uses Claude Vision to semantically identify shapes on the canvas and execute voice-commanded manipulations.

When you say "move the car to the left," the system runs Claude Vision to identify which of the 11 shapes constitute "the car," groups them through hybrid proximity detection, calculates the transformation, and executes it. The voice agent replaces buttons entirely. Continuous listening, guided drawing when objects are missing, confirmation before transforms. Link to interactive Architecture page.

Redesigned interface addressing privacy concerns with subtle response options to prevent peer observation.

Redesigned interface addressing privacy concerns with subtle response options to prevent peer observation.

Process

The project evolved through eleven phases over three months. It started as a pattern-matching voice parser that could handle ten exact commands. Phase 2 replaced this with Claude Haiku function calling, enabling flexible natural language. Phase 3 added semantic shape identification: "the sun" maps to a yellow circle through property analysis. Phase 6 integrated Claude Vision, jumping accuracy from 70% to 95% by actually seeing the canvas. Phase 7 introduced hybrid proximity grouping. Phase 9 replaced the two-button interface entirely with an ElevenLabs conversational agent.

Each phase revealed something about the relationship between voice, vision, and gesture. The stale closure bug taught that speech callbacks in React need refs, not state. "Move the sun" with no destination reveals the system's need for specificity, which is also the tool teaching you its language.

Brazilian localization extending beyond translation to incorporate cultural expressions of mental well-being.

What Emerged

The system's misunderstandings act as a mirror for communicative assumptions. "Move this" fails because the system cannot resolve pronouns without visual context. "Move the sun" without a destination produces an arbitrary guess. These are productive failures: the tool teaching you to be specific, to name rather than point, to provide spatial anchors. Over time, you develop fluency.

The image that results carries the history of its making: the negotiation, the corrections, the moments where the gap between intent and interpretation produced something unexpected. New modalities are necessary. Tools designed to teach us their language while learning ours.

Family-focused interface simplifying mental health monitoring with an approachable onboarding experience.

Data visualization and milestone tracking designed for parental engagement and understanding.

You can email me. I’m active on Twitter, occasionally on LinkedIn, and surfing the internet on Are.na.