
Voice Design - The First Generative AI For Audio
- Category
- Product
- Date
Photograph a statue. Identify the figures depicted. Then have a real-time voice conversation with them - each character speaking in a distinct, period-appropriate voice.
That is what you can build with ElevenLabs' Voice Design and Agent APIs. In this post, we walk through the architecture of a mobile web app that combines computer vision with voice generation to turn public monuments into interactive experiences. Everything here is replicable with the APIs and code samples below.
The entire app below was built from a single prompt, tested to successfully one-shot in Cursor with Claude Opus 4.5 (high) from an empty NextJS project. If you want to skip ahead and build your own, paste this into your editor:
You can also use the ElevenLabs Agent Skills instead of linking to the docs. These are based on the docs and can yield even better results.
The rest of this post breaks down what that prompt produces.
The pipeline has five stages:
When a user photographs a statue, the image is sent to an OpenAI vision-capable model. A structured system prompt extracts the artwork name, location, artist, date, and - critically - a detailed voice description for each character. The system prompt includes the expected JSON output format:
For a photograph of the Boudica statue on Westminster Bridge, London, the response looks like this:
The quality of the voice description directly determines the quality of the generated voice. The Voice Design prompting guide covers this in detail, but the key attributes to include are: audio quality marker ("Perfect audio quality."), age and gender, tone/timbre (deep, resonant, gravelly), a precise accent ("thick Celtic British accent" rather than just "British"), and pacing. More descriptive prompts yield more accurate results - "a tired New Yorker in her 60s with a dry sense of humor" will outperform "an older female voice" every time.
A few things worth noting from the guide: use "thick" rather than "strong" when describing accent prominence, avoid vague terms like "foreign," and for fictional or historical characters you can suggest real-world accents as inspiration (e.g., "an ancient Celtic queen with a thick British accent, regal and commanding").
The Voice Design API generates new synthetic voices from text descriptions - no voice samples or cloning required. This makes it well-suited for historical figures where source audio does not exist.
The process has two steps.
The text parameter matters. Longer, character-appropriate sample text (50+ words) produces more stable results - match the dialogue to the character rather than using a generic greeting. The Voice Design prompting guide covers this in more detail.
Once previews are generated, select one and create a permanent voice:
For multi-character statues, voice creation runs in parallel. Five characters' voices generate in roughly the same time as one:
With voices created, the next step is configuring an ElevenLabs Agent that can switch between character voices in real time.
The supportedVoices array tells the agent which voices are available. The Agents platform handles voice switching automatically - when the LLM's response indicates a different character is speaking, the TTS engine routes that segment to the correct voice.
Making multiple characters feel like a real group - rather than a sequential Q&A - requires deliberate prompt design:
The final piece is the client connection. ElevenLabs Agents support WebRTC for low-latency voice conversations - noticeably faster than WebSocket-based connections, which matters for natural turn-taking.
The useConversation hook handles audio capture, streaming, voice activity detection, and playback.
For users who want more historical context before starting a conversation, you can add an enhanced research mode using OpenAI's web search tool:
This project shows that when combining different modalities of AI - text, research, vision, and audio - we’re able to build experiences that cross both the digital and real world. There’s a lot of unexplored potential in multi-modal agents that we’d love to see more people explore for education, work, and fun.
The APIs used in this project - Voice Design, ElevenAgents, and OpenAI - are all available today.



