Overview
A guide on how to generate voiceovers using your voice on ElevenLabs.
Our Text to Speech technology is the backbone of ElevenLabs. Many of the features we offer are built around this technology, and numerous excellent services around the web are powered by our technology, where the highest quality AI-generated speech is needed.
The speech model takes text and converts it into extremely realistic speech. On the surface, itβs a fairly simple concept, but the execution is anything but. There are a few things to keep in mind to achieve the best possible results, and we will try to cover most of it.
We are constantly working on improving our service and technology, adding new features and settings. Therefore, it can be helpful to check back periodically to ensure you have the latest information and are following the most recent guidelines.
There are two main factors that we emphasize as being of utmost importance to ensure the best possible experience when using our Text to Speech.
Voice Selection
Model Selection
Voice Settings
![](https://files.buildwithfern.com/https://elevenlabs.docs.buildwithfern.com/docs/2025-01-23T08:52:19.137Z/product/speech-synthesis/images/tts_voices.webp)
Getting yourself familiar with these different settings and options will be very important in getting the best possible result. For Text to Speech, there are three main selections you need to make.
Voices
We offer many types of voices, including Default Voices that have been specfically curated to be the highest quality; completely synthetic voices created using our Voice Design tool; you can create your own collection of cloned voices using our two technologies: Instant Voice Clones and Professional Voice Clones; browse through our voice library to find the perfect voice for your production.
Not all voices are equal, and a lot depends on the source audio used to create that voice. Some voices will perform better than others, while some will be more stable than others. Additionally, certain voices will be more easily cloned by the AI than others, and some voices may work better with one model and one language compared to another. All of these factors are important to consider when selecting your voice.
Models
As of December 2024, ElevenLabs offers two families of models: standard (high-quality) models and Flash models, which are optimized for low latency. Each family includes both English-only and multilingual models, tailored for specific use cases with strengths in either speed, accuracy, or language diversity.
- Standard models (Multilingual v2, Multilingual v1, English v1) are optimized for quality and accuracy, ideal for content creation. These models offer the best quality and stability but have higher latency.
- Flash models (Flash v2, Flash v2.5) are designed for low-latency applications like real-time conversational AI. They deliver great performance with faster processing speeds, though with a slight trade-off in accuracy and stability.
If you want to find more detailed specifications about which languages each model offers, you can find all that information in our help article here.
For advice on how to deal with issues that might arise, please see our guide to troubleshooting.
Settings
Our users have found different workflows that work for them. The one youβll see most often is setting stability around 50 and similarity near 75, with minimal changes thereafter. Of course, this all depends on the original voice and the style of performance youβre aiming for.
Itβs important to note that the AI is non-deterministic; setting the sliders to specific values wonβt guarantee the same results every time. Instead, the sliders function more as a range, determining how wide the randomization can be between each generation. Setting stability low means a wider range of randomization, often resulting in a more emotive performance, but this is also highly dependent on the voice itself.
For a more lively and dramatic performance, it is recommended to set the stability slider lower and generate a few times until you find a performance you like.
On the other hand, if you want a more serious performance, even bordering on monotone on very high values, it is recommended to set the stability slider higher. And since itβs more consistent and stable, you usually donβt need to do as many generations to get what you are looking for. Experiment to find what works best for you!
Good to know
Good input equals good output
The first factor, and one of the most important, is that good, high-quality, and consistent input will result in good, high-quality, and consistent output.
If you provide the AI with audio that is less than idealβfor example, audio with a lot of noise, reverb on clear speech, multiple speakers, or inconsistency in volume or performance and deliveryβthe AI will become more unstable, and the output will be more unpredictable.
If you plan on cloning your own voice, we strongly recommend that you go through our guidelines in the documentation for creating proper voice clones, as this will provide you with the best possible foundation to start from. Even if you intend to use only Instant Voice Clones, it is advisable to read the Professional Voice Cloning section as well. This section contains valuable information about creating voice clones, even though the requirements for these two technologies are slightly different.
Use the right voice
The second factor to consider is that the voice you select will have a tremendous effect on the output. Not only, as mentioned in the first factor, is the quality and consistency of the samples used to create that specific clone extremely important, but also the language and tonality of the voice.
If you want a voice that sounds happy and cheerful, you should use a voice that has been cloned using happy and cheerful samples. Conversely, if you desire a voice that sounds introspective and brooding, you should select a voice with those characteristics.
However, it is also crucial to use a voice that has been trained in the correct language. For example, all of the professional voice clones we offer as default voices are English voices and have been trained on English samples. Therefore, if you have them speak other languages, their performance in those languages can be unpredictable. It is essential to use a voice that has been cloned from samples where the voice was speaking the language you want the AI to then speak.
Use proper formatting
This may seem slightly trivial, but it can make a big difference. The AI tries to understand how to read something based on the context of the text itself, which means not only the words used but also how they are put together, how punctuation is applied, the grammar, and the general formatting of the text.
This can have a small but impactful influence on the AIβs delivery. If you were to misspell a word, the AI wonβt correct it and will try to read it as written.
Nondeterministic
The settings of the AI are nondeterministic, meaning that even with the same initial conditions (voice, settings, model), it will give you slightly different output, similar to how a voice actor will deliver a slightly different performance each time.
This variability can be due to various factors, such as the options mentioned earlier: voice, settings, model. Generally, the breadth of that variability can be controlled by the stability slider. A lower stability setting means a wider range of variability between generations, but it also introduces inter-generational variability, where the AI can be a bit more performative.
A wider variability can often be desirable, as setting the stability too high can make certain voices sound monotone as it does give the AI the same leeway to generate more variable content. However, setting the stability too low can also introduce other issues where the generations become unstable, especially with certain voices that might have used less-than-ideal audio for the cloning process.
The default setting of 50 is generally a great starting point for most applications.