

Images Produced in Dialogue

Most AI creative tools produce ontologically similar artefacts from previous technology with degraded quality, or just make things faster. They miss the question: what becomes possible that wasn't before? AI tools carry the totality of human epistemology in their training data, but they are biased toward statistical probability. They give you what is most likely, not what is most yours.
What if image creation were treated as dialogue, not transaction. You speak and draw simultaneously, operating in two registers: voice which is semantic, gesture which is spatial. The tool triangulates meaning from both. The gap between what you meant and what it heard is where the conversation happens. You learn the tool's language. It learns yours. The image is a trace of exchange.
Contribution
Software Research
Team
Solo
Demo of Final prototype. (listen with sound on)
Ideas are Fuzzy Dreams
Ideas are fuzzy, creative kernels that aren't fully formed. The moment you send a prompt, the model takes that kernel and crystallises it. The gaps get filled with what is statistically likely. A couple with an umbrella. A dog. Mountains you never specified.
The problem isn't that the output is bad. It's that you hadn't finished thinking. The artefact arrives before the intent does, covered in the model's assumptions rather than yours. What comes back is less what you wanted and more what was most likely.
Drawing in dialogue
Before I was in tech I was an architect, and the way we produced images there was quite different. We drew on paper while talking through ideas. A pause before you spoke carried weight. Drawing over someone else's line with more pressure was an assertion. Speaking without drawing might mean you were still searching. And drawing without speaking might mean you had found something.
There was meaning in what was said and unsaid, in how you drew and when you didn't. In the end the image that resulted mattered less than the context that had built up around it. I wanted to build something that captured that familiar feeling. A place where clarity was arrived at, and fuzzy ideas were reconciled through dialogue with a tool.
Why not just keep prompting?
Yes, you can absolutely prompt back and forth with an image model and get to where you want. But the perfect fidelity of the first thing isn't always helpful, and it locks in the direction. You end up chipping at the edges, fighting tooth and nail to get it to move diagonally.
Dialogue does something different. It doesn't produce an image, it produces context. The conversation does two things at once. It surfaces my intent by pressing me to articulate what I'm after, and it gives the model a real-time picture of what I've tried, what I've ruled out, and why. When image generation finally happens, it happens on top of that context, not in place of it. The image is a separate step. The conversation is the work.
A better way
An interface that inquires and questions as you go helps the tool develop its own understanding, while also helping you develop your own idea through feedback. Making images this way feels right to me. Dialogue creates surface area for thinking. That accumulated context is what pushes the model away from statistically likely outputs and toward something that is actually mine.
An image made in dialogue carries the trace of that exchange. It holds rich context the final image alone can't. The final image doesn't arrive first. It arrives last, carrying the history of its making.
The System
Four concurrent pipelines run simultaneously. The Drawing pipeline captures pen input through tldraw, creating vector shapes. The Voice pipeline handles continuous listening through ElevenLabs, accumulating transcripts and cleaning them for refined prompts. The Render pipeline captures the canvas as a snapshot every ~100ms and passes it through FAL.ai's Latent Consistency Model for real-time image generation. The Transform pipeline uses Claude Vision to semantically identify shapes on the canvas and execute voice-commanded manipulations.
When you say "move the car to the left," the system runs Claude Vision to identify which of the 11 shapes constitute "the car," groups them through hybrid proximity detection, calculates the transformation, and executes it. The voice agent replaces buttons entirely. Continuous listening, guided drawing when objects are missing, confirmation before transforms. Link to interactive Architecture page.






