
Claude Sonnet 4 is now available in Conversational AI
Anthropic's new Claude Sonnet 4 model is now available on the ElevenLabs platform, we’ve integrated it to give you a powerful new tool for building more capable and intuitive voice experiences.
How I built a full AI-powered tool that turns rough prompts into finished video ads.
I’ve spent two decades creating content — from journalism to product videos. AI makes it possible to unlock entirely new creative workflows. With the intersection of vibe coding and increasingly realistic video generation, I wondered whether I could create a tool that takes a simple prompt and makes a 20 second ad spot.
The concept was straightforward: type in a rough product idea, and get back a fully produced 30-second commercial with AI-generated visuals, voiceover, and sound effects. Here's how I built it using ElevenLabs TTS and SFX APIs, Google's Gemini, and VEO 2 from Google for video generation. At the time of creation VEO 3 hadn’t been released.
The final version was created almost completely with Anthropic’s impressive Claude 4 Opus, albeit over a few days as I kept hitting the rate limit. unknown node
I chose Node.js with Express for the backend and React for the frontend. Node handles real-time updates as videos generate, while React’s component-based architecture makes the multi-step interface easy to manage and extend.
I've written code on and off since childhood — starting with a robot pen in primary school. But I’ve always been more of a product thinker than a full-time engineer. Tools like Claude 4 Opus changed that. With the right prompts, I could move fast, implement features correctly, and focus on product logic rather than boilerplate.
This isn’t about outsourcing creativity to AI — it’s about building smarter with the right tools.
Creating a commercial for a new product or service, even one that is only 20 seconds long, involves multiple complex steps, so I broke it down into eight distinct phases:
Each step builds on the previous one, creating a pipeline that transforms a simple idea into a complete commercial. At each stage the human has full control to change any element or regenerate any piece of text, video or audio.
The first challenge was that most people don't start with fully-formed product ideas. They might type something vague like "something for productivity." That's where Gemini comes in.
I used Google's Gemini 2.0 Flash model to enhance rough ideas into concrete product concepts. The prompt engineering here was crucial – I needed Gemini to be specific and concrete, not vague and generic. Instead of accepting "something for fitness," the system transforms it into something like "FitPulse AI: A smart wristband that uses advanced biometrics to create personalized micro-workouts throughout your day."
1 | """Enhance a product idea using Gemini""" |
2 | |
3 | prompt = f""" |
4 | Enhance this product idea to make it more compelling: |
5 | |
6 | Original idea: {idea} |
7 | Target mood: {mood} |
8 | Target audience: {audience} |
9 | |
10 | Make it: |
11 | 1. Clear and specific about the value proposition |
12 | 2. Appeal to {audience} |
13 | 3. Match the {mood.lower()} tone |
14 | 4. Be memorable and marketable |
15 | |
16 | Keep it to 2-3 sentences. |
17 | """ |
Next came script generation. Again using Gemini, I structured the output as four 5-second scenes, each with three components:
The key was making Gemini understand mood and audience. A "quirky" commercial for millennials needs a different language than a "professional" one for enterprise customers.
I spent considerable time refining the prompts to avoid generic AI-speak and create scripts that felt tailored to each product.
1 | """Generate a 4-scene commercial script""" |
2 | |
3 | prompt = f""" |
4 | Create a 30-second commercial script with exactly 4 scenes. |
5 | |
6 | Product: {product_name} |
7 | Audience: {target_audience} |
8 | Key Message: {key_message} |
9 | Mood: {mood} |
10 | |
11 | Return a JSON array with 4 scenes, each with: |
12 | - number: 1-4 |
13 | - duration: 5 |
14 | - script: What the voiceover says |
15 | - videoPrompt: Visual description for video generation |
16 | - sfxPrompt: Sound effects description |
17 | |
18 | Example format: |
19 | [{{"number": 1, "duration": 5, "script": "...", "videoPrompt": "...", "sfxPrompt": "..."}}] |
20 | """ |
21 |
I used FAL.ai’s hosted API for Google’s VEO 2 model. Each scene's video prompt gets sent to FAL.ai, which returns a 5-second video clip. This was one of the trickier integrations – handling long generation times, managing API limits, and providing feedback to users while they wait.
I had originally planned to use Google AI Studio or Vertex AI for the Veo 2 API, as this would have meant I was using the same API key as Gemini, but I couldn’t get Veo 2 to work on my account.
I implemented a state management system that saves generated videos locally, so users don't have to regenerate expensive content if they navigate away and come back. When you're hitting rate limits on Claude, the last thing you want is to lose your generated videos because you refreshed the page.
The video content for a 20 second clip, assuming no recuts or regenerations came to about $10.
Here's where I got creative with ElevenLabs' APIs. While ElevenLabs is primarily known for voice generation, we also have a sound effects API that is very impressive. See the incredible Soundboard example of potential use cases.
I used it to generate four variations of sound effects for each scene – upbeat, energetic, calm, and dramatic. Users can preview each option and select what fits their vision.
1 | const response = await elevenLabs.soundGeneration({ |
2 | text: modifiedPrompt, |
3 | duration_seconds: duration, |
4 | prompt_influence: 0.3 |
5 | }); |
6 |
With four video clips and four sound effect tracks, I needed to combine them. This meant diving deep into FFmpeg, the Swiss Army knife of video processing. The backend runs FFmpeg commands to:
Getting FFmpeg commands right took significant debugging. Audio mixing, in particular, requires careful attention to levels and timing. I learned that background audio should be reduced to about 30% volume when mixed with voiceover – any higher and it competes for attention, any lower and it might as well not be there.
For the voiceover, I integrated ElevenLabs' text-to-speech API to offer users a selection of voices. The system generates a single coherent voiceover script from all scene scripts, then sends it to ElevenLabs with optimized voice settings:
1 | const voiceSettings = { |
2 | stability: 0.75, |
3 | similarity_boost: 0.75, |
4 | style: 0.0, |
5 | use_speaker_boost: true |
6 | }; |
7 |
These settings provide a clear, professional narration that works well for commercials. After experimenting with different configurations, I found this balance delivers consistency without sounding robotic.
Building with multiple AI APIs means dealing with various failure modes. Rate limits, timeout errors, malformed responses – they all happen. Especially when you're debugging at 2 AM and VEO 2 decides to return something unexpected.
I implemented comprehensive error handling with fallback options:
The goal was to ensure users could always complete their commercial, even if some AI services were having a bad day.
Generating a commercial involves multiple AI API calls that can take several minutes. To improve the experience, I:
I also implemented a state persistence system. If someone closes their browser mid-generation, they can return and pick up where they left off. This wasn't in my original plan, but after losing my own progress a few times during testing, it became a priority.
Building this tool surfaced three key lessons.
First, prompt design is critical. The quality of output from any AI model depends heavily on how you frame the input. I spent as much time refining prompts as writing code.
Second, user experience beats technical complexity. Users don’t care how many AI services are involved — they care that the tool works. Progress indicators, error handling, and fast feedback loops make all the difference.
Third, AI assistants like Claude accelerate development. I focused on product logic while offloading boilerplate and syntax to the model. It’s not about skipping steps — it’s about building smarter.
What began as a weekend project turned into a real, extensible tool. Marketing teams could use it for prototyping, startups for pitch videos, and creators for sponsored content.
The system is flexible by design. You can change video styles by adjusting VEO 2 prompts, modify scene lengths for different formats, or add music via FFmpeg.
The real opportunity lies in orchestrating multiple AI systems. No single model can generate a full commercial — but combined, Gemini, VEO 2, and ElevenLabs can produce something far more powerful than any one of them alone.
This isn’t about AI replacing creators. It’s about giving creators better tools. After 20 years in content, I’ve seen a lot of change — but this shift feels foundational.
If you want to explore how ElevenLabs technology can help deliver new approaches to content and media get in touch with our sales team.
Anthropic's new Claude Sonnet 4 model is now available on the ElevenLabs platform, we’ve integrated it to give you a powerful new tool for building more capable and intuitive voice experiences.
Our AI agents can now seamlessly process both speech words and text inputs simultaneously, leading to more natural, efficient, and resilient user interactions.
ElevenLabs द्वारा संचालित कन्वर्सेशनल AI