How I built a text-to-commercial generator using ElevenLabs, Gemini, and VEO 2

How I built a full AI-powered tool that turns rough prompts into finished video ads.

ads

I’ve spent two decades creating content — from journalism to product videos. AI makes it possible to unlock entirely new creative workflows. With the intersection of vibe coding and increasingly realistic video generation, I wondered whether I could create a tool that takes a simple prompt and makes a 20 second ad spot.

The concept was straightforward: type in a rough product idea, and get back a fully produced 30-second commercial with AI-generated visuals, voiceover, and sound effects. Here's how I built it using ElevenLabs TTS and SFX APIs, Google's Gemini, and VEO 2 from Google for video generation. At the time of creation VEO 3 hadn’t been released.

The final version was created almost completely with Anthropic’s impressive Claude 4 Opus, albeit over a few days as I kept hitting the rate limit. unknown node

Stack selection: Node.js, Express, React, and Claude 4 Opus

A commercial for "finding places to eat lunch in a park"

I chose Node.js with Express for the backend and React for the frontend. Node handles real-time updates as videos generate, while React’s component-based architecture makes the multi-step interface easy to manage and extend.

I've written code on and off since childhood — starting with a robot pen in primary school. But I’ve always been more of a product thinker than a full-time engineer. Tools like Claude 4 Opus changed that. With the right prompts, I could move fast, implement features correctly, and focus on product logic rather than boilerplate. 

This isn’t about outsourcing creativity to AI — it’s about building smarter with the right tools.

Eight-step wizard: From prompt to finished ad

Eight step wizard

Creating a commercial for a new product or service, even one that is only 20 seconds long, involves multiple complex steps, so I broke it down into eight distinct phases:

  1. Product Information
  2. Script Generation
  3. Video Creation
  4. Sound Effects
  5. Video Assembly
  6. Voice Over
  7. Final Video
  8. Social Posts

Each step builds on the previous one, creating a pipeline that transforms a simple idea into a complete commercial. At each stage the human has full control to change any element or regenerate any piece of text, video or audio.

A commercial for "Epoch" matching

Refining ideas with Gemini Flash

The first challenge was that most people don't start with fully-formed product ideas. They might type something vague like "something for productivity." That's where Gemini comes in.

I used Google's Gemini 2.0 Flash model to enhance rough ideas into concrete product concepts. The prompt engineering here was crucial – I needed Gemini to be specific and concrete, not vague and generic. Instead of accepting "something for fitness," the system transforms it into something like "FitPulse AI: A smart wristband that uses advanced biometrics to create personalized micro-workouts throughout your day."

1"""Enhance a product idea using Gemini"""
2
3 prompt = f"""
4 Enhance this product idea to make it more compelling:
5
6 Original idea: {idea}
7 Target mood: {mood}
8 Target audience: {audience}
9
10 Make it:
11 1. Clear and specific about the value proposition
12 2. Appeal to {audience}
13 3. Match the {mood.lower()} tone
14 4. Be memorable and marketable
15
16 Keep it to 2-3 sentences.
17 """

Generating non-generic scripts with Gemini

Next came script generation. Again using Gemini, I structured the output as four 5-second scenes, each with three components:

  • The voiceover script
  • A video generation prompt
  • A sound effects description

The key was making Gemini understand mood and audience. A "quirky" commercial for millennials needs a different language than a "professional" one for enterprise customers. 

I spent considerable time refining the prompts to avoid generic AI-speak and create scripts that felt tailored to each product.

1 """Generate a 4-scene commercial script"""
2
3 prompt = f"""
4 Create a 30-second commercial script with exactly 4 scenes.
5
6 Product: {product_name}
7 Audience: {target_audience}
8 Key Message: {key_message}
9 Mood: {mood}
10
11 Return a JSON array with 4 scenes, each with:
12 - number: 1-4
13 - duration: 5
14 - script: What the voiceover says
15 - videoPrompt: Visual description for video generation
16 - sfxPrompt: Sound effects description
17
18 Example format:
19 [{{"number": 1, "duration": 5, "script": "...", "videoPrompt": "...", "sfxPrompt": "..."}}]
20 """
21

Creating 5s scenes with VEO 2

I used FAL.ai’s hosted API for Google’s VEO 2 model. Each scene's video prompt gets sent to FAL.ai, which returns a 5-second video clip. This was one of the trickier integrations – handling long generation times, managing API limits, and providing feedback to users while they wait.

I had originally planned to use Google AI Studio or Vertex AI for the Veo 2 API, as this would have meant I was using the same API key as Gemini, but I couldn’t get Veo 2 to work on my account.

I implemented a state management system that saves generated videos locally, so users don't have to regenerate expensive content if they navigate away and come back. When you're hitting rate limits on Claude, the last thing you want is to lose your generated videos because you refreshed the page. 

The video content for a 20 second clip, assuming no recuts or regenerations came to about $10.

Using ElevenLabs for sound effects and voiceover

Here's where I got creative with ElevenLabs' APIs. While ElevenLabs is primarily known for voice generation, we also have a sound effects API that is very impressive. See the incredible Soundboard example of potential use cases. 

I used it to generate four variations of sound effects for each scene – upbeat, energetic, calm, and dramatic. Users can preview each option and select what fits their vision.

1const response = await elevenLabs.soundGeneration({
2 text: modifiedPrompt,
3 duration_seconds: duration,
4 prompt_influence: 0.3
5});
6

Assembling final videos with FFmpeg

With four video clips and four sound effect tracks, I needed to combine them. This meant diving deep into FFmpeg, the Swiss Army knife of video processing. The backend runs FFmpeg commands to:

  1. Mix sound effects with each video clip
  2. Concatenate all clips into one video
  3. Add the voiceover track to the final video

Getting FFmpeg commands right took significant debugging. Audio mixing, in particular, requires careful attention to levels and timing. I learned that background audio should be reduced to about 30% volume when mixed with voiceover – any higher and it competes for attention, any lower and it might as well not be there.

Voiceover: Where ElevenLabs truly shines

For the voiceover, I integrated ElevenLabs' text-to-speech API to offer users a selection of voices. The system generates a single coherent voiceover script from all scene scripts, then sends it to ElevenLabs with optimized voice settings:

1const voiceSettings = {
2 stability: 0.75,
3 similarity_boost: 0.75,
4 style: 0.0,
5 use_speaker_boost: true
6};
7

These settings provide a clear, professional narration that works well for commercials. After experimenting with different configurations, I found this balance delivers consistency without sounding robotic.

Resilient error handling and user experience

Building with multiple AI APIs means dealing with various failure modes. Rate limits, timeout errors, malformed responses – they all happen. Especially when you're debugging at 2 AM and VEO 2 decides to return something unexpected.

I implemented comprehensive error handling with fallback options:

  • If Gemini fails, the system provides intelligent fallback scripts
  • If video generation fails, placeholder videos are available
  • If sound generation fails, basic audio tracks are used

The goal was to ensure users could always complete their commercial, even if some AI services were having a bad day.

Performance considerations

A commercial for "Globetrotter Grocer"

Generating a commercial involves multiple AI API calls that can take several minutes. To improve the experience, I:

  • Process videos in parallel where possible
  • Show real-time progress indicators
  • Save expensive generated content locally
  • Allow users to regenerate individual components

I also implemented a state persistence system. If someone closes their browser mid-generation, they can return and pick up where they left off. This wasn't in my original plan, but after losing my own progress a few times during testing, it became a priority.

Key takeaways and what’s next

Building this tool surfaced three key lessons.

First, prompt design is critical. The quality of output from any AI model depends heavily on how you frame the input. I spent as much time refining prompts as writing code.

Second, user experience beats technical complexity. Users don’t care how many AI services are involved — they care that the tool works. Progress indicators, error handling, and fast feedback loops make all the difference.

Third, AI assistants like Claude accelerate development. I focused on product logic while offloading boilerplate and syntax to the model. It’s not about skipping steps — it’s about building smarter.

What began as a weekend project turned into a real, extensible tool. Marketing teams could use it for prototyping, startups for pitch videos, and creators for sponsored content.

The system is flexible by design. You can change video styles by adjusting VEO 2 prompts, modify scene lengths for different formats, or add music via FFmpeg.

The real opportunity lies in orchestrating multiple AI systems. No single model can generate a full commercial — but combined, Gemini, VEO 2, and ElevenLabs can produce something far more powerful than any one of them alone.

This isn’t about AI replacing creators. It’s about giving creators better tools. After 20 years in content, I’ve seen a lot of change — but this shift feels foundational.

If you want to explore how ElevenLabs technology can help deliver new approaches to content and media get in touch with our sales team.

और जानें

Product
Multimodal

Introducing Multimodal Conversational AI

Our AI agents can now seamlessly process both speech words and text inputs simultaneously, leading to more natural, efficient, and resilient user interactions.

ElevenLabs

उच्चतम गुणवत्ता वाले AI ऑडियो के साथ बनाएं

मुफ़्त में आज़माएं

क्या आपके पास पहले से अकाउंट है? लॉग इन करें