.webp&w=3840&q=95)
7 tips for creating a professional-grade voice clone in ElevenLabs
Learn how to create professional-grade voice clones with ElevenLabs using these 7 essential tips.
Introducing Eleven v3 (alpha)
Try v3Vibe Draw combines ElevenLabs' voice AI with FLUX Kontext for voice-powered image creation.
Voice interfaces are changing how we communicate with AI. What if creating an image was as easy as describing it out loud?
That’s the idea that led to me creating Vibe Draw as a weekend project. It is a voice-first creative tool that pairs ElevenLabs’ Voice AI with Black Forest Labs’ FLUX Kontext to turn spoken prompts into images.
FLUX Kontext represents a new class of image model. Unlike traditional text-to-image systems, Kontext handles both generation and editing. It can create new images from prompts, modify existing ones, and even merge multiple reference images into a single output.
While models like GPT-4o and Gemini 2 Flash offer multimodal capabilities, FLUX Kontext is purpose-built for high-quality visual manipulation. In testing, I could change individual letters in stylized text or reposition an object — just by describing the change.
That’s when I thought: “Why not do this with voice?” And what better foundation than ElevenLabs’ powerful voice technology?
Building a voice-driven image system required solving five key problems:
Vibe Draw runs entirely client-side and integrates the following components:
This approach keeps the prototype lightweight, but production deployments should proxy requests server-side for security.
Vibe Draw uses ElevenLabs’ text-to-speech API, tuned for conversational responsiveness:
1 | const voiceSettings = { |
2 | model_id: "eleven_turbo_v2", |
3 | voice_settings: { |
4 | stability: 0.5, |
5 | similarity_boost: 0.75 |
6 | } |
7 | }; |
8 |
To create variety, voice responses are randomly selected from pre-defined templates:
1 | const responses = { |
2 | generating: [ |
3 | "Ooh, I love that idea! Let me bring it to life...", |
4 | "That sounds awesome! Creating it now...", |
5 | "Great description! Working on it..." |
6 | ], |
7 | editing: [ |
8 | "Got it! Let me tweak that for you...", |
9 | "Sure thing! Making those changes...", |
10 | "No problem! Adjusting it now..." |
11 | ] |
12 | }; |
13 | |
14 | function getRandomResponse(type) { |
15 | const options = responses[type]; |
16 | return options[Math.floor(Math.random() * options.length)]; |
17 | } |
18 |
Overlapping voice responses break the illusion of conversation. Vibe Draw solves this with an audio queue system:
1 | let audioQueue = []; |
2 | let isPlayingAudio = false; |
3 | |
4 | async function queueAudioResponse(text) { |
5 | audioQueue.push(text); |
6 | if (!isPlayingAudio) { |
7 | playNextAudio(); |
8 | } |
9 | } |
10 |
Each message plays fully before triggering the next.
The system uses keyword and context detection to decide whether a user prompt is a new image request or an edit:
1 | const editKeywords = [ ... ]; |
2 | const contextualEditPhrases = [ ... ]; |
3 | |
4 | if (currentImage && (hasEditKeyword || hasContextClue)) { |
5 | await handleEditRequest(text); |
6 | } else { |
7 | await handleGenerateRequest(text); |
8 | } |
9 |
This approach ensures edits are only applied when there's an existing image and context makes it clear.
Kontext supports two modes: generation and editing.
1 | const response = await fetch('https://fal.run/fal-ai/flux-pro/kontext/text-to-image', { |
2 | ... |
3 | body: JSON.stringify({ |
4 | prompt: enhancedPrompt, |
5 | guidance_scale: 3.5, |
6 | num_images: 1, |
7 | safety_tolerance: "2", |
8 | output_format: "jpeg" |
9 | }) |
10 | }); |
11 |
1 | const response = await fetch('https://fal.run/fal-ai/flux-pro/kontext', { |
2 | ... |
3 | body: JSON.stringify({ |
4 | prompt: instruction, |
5 | image_url: currentImage, |
6 | guidance_scale: 3.5, |
7 | num_images: 1 |
8 | }) |
9 | }); |
10 |
Some prompts imply changes that exceed the editing API’s limits. When detected, the system offers a fallback:
1 | if (hasSignificantChange) { |
2 | try { |
3 | const enhanced = instruction + ", maintain composition but apply requested changes"; |
4 | await editImage(enhanced); |
5 | } catch { |
6 | queueAudioResponse("That's quite a transformation! Would you like me to create a fresh image instead?"); |
7 | } |
8 | } |
9 |
UI feedback helps users track the system’s state:
1 | function updateUI(state) { |
2 | switch(state) { |
3 | case 'listening': ... |
4 | case 'processing': ... |
5 | case 'generating': ... |
6 | case 'ready': ... |
7 | } |
8 | } |
9 |
Natural conversation requires natural timing:
1 | if (Math.random() > 0.7) { |
2 | setTimeout(() => { |
3 | queueAudioResponse("Want me to change anything about it?"); |
4 | }, 3000); |
5 | } |
6 |
To preserve context, session data is stored:
1 | const saveState = () => { ... }; |
2 | const restoreState = () => { ... }; |
3 |
To ensure responsiveness:
Conversational UIs open the door to new capabilities:
Building Vibe Draw revealed several core principles for voice-first tools:
Vibe Draw shows what happens when conversational voice AI meets visual creativity. ElevenLabs’ natural speech synthesis and FLUX Kontext’s image APIs combine to create a new way to make—no clicks, no sliders—just speech.
When creating is as easy as describing, we remove the barriers between imagination and execution.
The complete source code is available on GitHub. To run your own version:
Learn how to create professional-grade voice clones with ElevenLabs using these 7 essential tips.
Learn how to create a beat from scratch.
Powered by ElevenLabs Conversational AI