
The Reader App is now on Android
- Category
- Product
- Date
Transform your voice into another character with full control over emotions, timing, and delivery.
Voice Changer was originally called speech-to-speech. In the context of AI voice agents, "speech-to-speech" also refers to fused architectures where a single model handles audio input and output directly. ElevenAgents uses an advanced cascaded architecture for its platform. Learn more: Cascaded vs Fused models.
We have added Voice Changer to Speech Synthesis. Voice Changer is a voice conversion tool that takes a recording of one voice and makes it sound as if spoken by another - while preserving the original performance. That means the emotions, timing, pacing, and pronunciation you put into the recording carry over to the output voice.
This gives you a level of control that text-to-speech prompts alone cannot always achieve. There are two primary ways to use it.
Extract more emotion from a voice. Not all voices respond equally well to emotional direction through text-to-speech prompts. Voice Changer lets you record or upload highly expressive speech and replicate that emotional range in a different voice. If you need a professional narrator to sound warmer, or a children's book character to sound more playful, you can perform the delivery yourself and have Voice Changer transfer it to the target voice.
Fine-tune speech delivery. Our text-to-speech models typically handle intonation well out of the box. But when you need precise control over how a specific phrase is delivered - where the emphasis falls, how a pause lands, what the cadence feels like - Voice Changer lets you demonstrate it with your own voice and then apply that delivery to any voice you choose. This will become even more useful once we integrate Voice Changer directly into Studio, but the goal is the same: give you precise control over your output.
Here is a walkthrough from one of our community members:
To convert source speech into target speech, we need to express source speech content with target speech characteristics. A useful analogy is face-swapping: apps that take two faces and blend them together, rendering one person's features inside the mapped structure of another's.
The way to go about this is to take the image of a face and map its attributes. The markers in the example below do just that - they are the limits inside which the other face would be rendered.
The trick in voice conversion is to render source speech content using target speech phonemes. But there’s a tradeoff here, much as there is in the face-swapping example: the more markers you use to map one face's attributes, the more constraints you impose on the face you map inside them. Fewer markers means less constraints.
The same is true of voice conversion. The more preference we give to target speech, the more we risk becoming out of sync with source speech. But if we don't give it enough preference, we risk losing much of what makes that speech characteristic. For example, if we were to render the recording of somebody shouting angrily in a whispery voice, we’d be in trouble. Give too much preference to source speech emotions and the price we pay is losing the impression it's a whispery voice speaking. Too much emphasis on the whispery speech pattern and we lose the emotional charge of source speech.
We are making changes to the default voices available in Speech Synthesis. We will be retiring a few voices and replacing them with new ones, with over 20 additions planned in the coming weeks.
We will also start providing UI information on how long each voice is expected to remain available. Throughout December, we will revamp our voice sharing and usage compensation features to improve voice variety. More details soon.
Turbo v2 is the result of months of research from our team. It is designed for real-time interactions but works for any use case. It also supports the standard (m)uLaw 8kHz format for IVR systems.
Studio now supports industry-standard audiobook submission guidelines, including gain adjustment and dynamic compression. You can also embed metadata (ISBN, author, and title) directly in your Studio project.
This has been one of our most requested features. Last month, we added SSML tag support for specifying pronunciation using the IPA and CMU dictionaries with our English models. We have now released pronunciation dictionary support in the Studio UI, allowing you to upload a file specifying pronunciation using IPA, CMU, or word substitutions (aliases). Dictionary files use the industry-standard open .PLS lexicon file format.
IPA and CMU are currently supported by Turbo v2 English. Word substitutions are supported by all models and languages. Full documentation is available here.
If you have any feedback, don't hesitate to reach us out on Discord!
Say it how you want and hear it delivered in a completely different voice, with full control over the performance. Capture whispers, laughs, accents, and subtle emotional cues.



