Przedstawiamy Eleven v3 Alpha

Wypróbuj v3

Tworzenie Vibe Draw: połączenie ElevenLabs z FLUX Kontext do tworzenia obrazów za pomocą głosu

Vibe Draw combines ElevenLabs' voice AI with FLUX Kontext for voice-powered image creation.

Vibe Draw

Voice interfaces are changing how we communicate with AI. What if creating an image was as easy as describing it out loud?

That’s the idea that led to me creating Vibe Draw as a weekend project. It is a voice-first creative tool that pairs ElevenLabs’ Voice AI with Black Forest Labs’ FLUX Kontext to turn spoken prompts into images.

FLUX Kontext represents a new class of image model. Unlike traditional text-to-image systems, Kontext handles both generation and editing. It can create new images from prompts, modify existing ones, and even merge multiple reference images into a single output.

While models like GPT-4o and Gemini 2 Flash offer multimodal capabilities, FLUX Kontext is purpose-built for high-quality visual manipulation. In testing, I could change individual letters in stylized text or reposition an object — just by describing the change.

That’s when I thought: “Why not do this with voice?” And what better foundation than ElevenLabs’ powerful voice technology?

Vibe Draw

The technical challenge

Building a voice-driven image system required solving five key problems:

  1. Natural language understanding — Differentiating between new creation and edits
  2. Contextual awareness — Maintaining continuity across interactions
  3. Audio management — Avoiding overlapping responses and managing queues
  4. Visual generation — Seamless transitions between generation and editing
  5. User experience — Making advanced AI interactions feel intuitive

Architecture overview

Vibe Draw runs entirely client-side and integrates the following components:

  • Web Speech API for speech recognition
  • ElevenLabs TTS API for voice responses
  • FLUX Kontext API for image generation and editing
  • Custom intent detection for understanding user input

This approach keeps the prototype lightweight, but production deployments should proxy requests server-side for security.

Implementing Voice with ElevenLabs

Vibe Draw uses ElevenLabs’ text-to-speech API, tuned for conversational responsiveness:

1const voiceSettings = {
2 model_id: "eleven_turbo_v2",
3 voice_settings: {
4 stability: 0.5,
5 similarity_boost: 0.75
6 }
7};
8

To create variety, voice responses are randomly selected from pre-defined templates:

1const responses = {
2 generating: [
3 "Ooh, I love that idea! Let me bring it to life...",
4 "That sounds awesome! Creating it now...",
5 "Great description! Working on it..."
6 ],
7 editing: [
8 "Got it! Let me tweak that for you...",
9 "Sure thing! Making those changes...",
10 "No problem! Adjusting it now..."
11 ]
12};
13
14function getRandomResponse(type) {
15 const options = responses[type];
16 return options[Math.floor(Math.random() * options.length)];
17}
18

Managing audio playback

Overlapping voice responses break the illusion of conversation. Vibe Draw solves this with an audio queue system:

1let audioQueue = [];
2let isPlayingAudio = false;
3
4async function queueAudioResponse(text) {
5 audioQueue.push(text);
6 if (!isPlayingAudio) {
7 playNextAudio();
8 }
9}
10

Each message plays fully before triggering the next.

Intent detection and context management

The system uses keyword and context detection to decide whether a user prompt is a new image request or an edit:

1const editKeywords = [ ... ];
2const contextualEditPhrases = [ ... ];
3
4if (currentImage && (hasEditKeyword || hasContextClue)) {
5 await handleEditRequest(text);
6} else {
7 await handleGenerateRequest(text);
8}
9

This approach ensures edits are only applied when there's an existing image and context makes it clear.

Image generation with FLUX Kontext

Image generated with Flux

Kontext supports two modes: generation and editing.

Generation (text to image)

1const response = await fetch('https://fal.run/fal-ai/flux-pro/kontext/text-to-image', {
2 ...
3 body: JSON.stringify({
4 prompt: enhancedPrompt,
5 guidance_scale: 3.5,
6 num_images: 1,
7 safety_tolerance: "2",
8 output_format: "jpeg"
9 })
10});
11

Editing (contextual transformation)

1const response = await fetch('https://fal.run/fal-ai/flux-pro/kontext', {
2 ...
3 body: JSON.stringify({
4 prompt: instruction,
5 image_url: currentImage,
6 guidance_scale: 3.5,
7 num_images: 1
8 })
9});
10

Handling complex transformations

Some prompts imply changes that exceed the editing API’s limits. When detected, the system offers a fallback:

1if (hasSignificantChange) {
2 try {
3 const enhanced = instruction + ", maintain composition but apply requested changes";
4 await editImage(enhanced);
5 } catch {
6 queueAudioResponse("That's quite a transformation! Would you like me to create a fresh image instead?");
7 }
8}
9

Optimizing the experience

Progressive feedback

UI feedback helps users track the system’s state:

1function updateUI(state) {
2 switch(state) {
3 case 'listening': ...
4 case 'processing': ...
5 case 'generating': ...
6 case 'ready': ...
7 }
8}
9

Intelligent timing

Natural conversation requires natural timing:

1if (Math.random() > 0.7) {
2 setTimeout(() => {
3 queueAudioResponse("Want me to change anything about it?");
4 }, 3000);
5}
6

Session state

To preserve context, session data is stored:

1const saveState = () => { ... };
2const restoreState = () => { ... };
3

Performance considerations

To ensure responsiveness:

  • Lazy loading — Only initialize APIs when needed
  • Debouncing — Limit API requests per interaction
  • Error handling — Recover gracefully from timeouts or failures
  • Resource cleanup — Dispose of audio objects and event listeners properly

What’s next

Conversational UIs open the door to new capabilities:

  • Multi-modal input — “Make it look more like this photo.”
  • Collaborative sessions — Multiple users contributing to a single design
  • Style memory — System learns your aesthetic over time
  • Real-time streaming — Stream image updates as the user speaks and integrate Conversational AI to allow for streamed speech.

Key takeaways

Building Vibe Draw revealed several core principles for voice-first tools:

  1. Context is everything — Tracking state makes interactions feel coherent
  2. Timing adds personality — Pacing responses makes AI feel responsive
  3. Fallbacks maintain momentum — When generation fails, offer alternatives
  4. Variety keeps it fresh — Repeating the same phrase breaks immersion

Conclusion

Vibe Draw shows what happens when conversational voice AI meets visual creativity. ElevenLabs’ natural speech synthesis and FLUX Kontext’s image APIs combine to create a new way to make—no clicks, no sliders—just speech.

When creating is as easy as describing, we remove the barriers between imagination and execution.

Try it yourself

The complete source code is available on GitHub. To run your own version:

  1. Clone the repository
  2. Add your ElevenLabs API key
  3. Add your FAL.ai API key
  4. Open vibe-draw-v2.html in a modern browser
  5. Click the microphone and start creating

    Interested in building your own voice-first experience? Explore ElevenLabs Conversational AI or contact us.

Zobacz więcej

ElevenLabs

Twórz z najwyższą jakością dźwięku AI