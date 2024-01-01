Our Text to Speech technology is the backbone of ElevenLabs. Many of the features we offer are built around this technology, and numerous excellent services around the web are powered by our technology, where the highest quality AI-generated speech is needed.

The speech model takes text and converts it into extremely realistic speech. On the surface, it’s a fairly simple concept, but the execution is anything but. There are a few things to keep in mind to achieve the best possible results, and we will try to cover most of it.

We are constantly working on improving our service and technology, adding new features and settings. Therefore, it can be helpful to check back periodically to ensure you have the latest information and are following the most recent guidelines.

There are two main factors that we emphasize as being of utmost importance to ensure the best possible experience when using our Text to Speech.

Getting yourself familiar with these different settings and options will be very important in getting the best possible result. For Text to Speech, there are three main selections you need to make.

Voices We offer many types of voices, including Default Voices that have been specfically curated to be the highest quality; completely synthetic voices created using our Voice Design tool; you can create your own collection of cloned voices using our two technologies: Instant Voice Clones and Professional Voice Clones; browse through our voice library to find the perfect voice for your production. Not all voices are equal, and a lot depends on the source audio used to create that voice. Some voices will perform better than others, while some will be more stable than others. Additionally, certain voices will be more easily cloned by the AI than others, and some voices may work better with one model and one language compared to another. All of these factors are important to consider when selecting your voice. Read more... Models As of September 2024, ElevenLabs offers two families of models: standard (high-quality) models and Turbo models, which are optimized for low latency. Each family includes both English-only and multilingual models, tailored for specific use cases with strengths in either speed, accuracy, or language diversity. Standard models (Multilingual v2, Multilingual v1, English v1) are optimized for quality and accuracy, ideal for content creation. These models offer the best quality and stability but have higher latency.

(Multilingual v2, Multilingual v1, English v1) are optimized for quality and accuracy, ideal for content creation. These models offer the best quality and stability but have higher latency. Turbo models (Turbo v2, Turbo v2.5) are designed for low-latency applications like real-time conversational AI. They deliver great performance with faster processing speeds, though with a slight trade-off in accuracy and stability. If you want to find more detailed specifications about which languages each model offers, you can find all that information in our help article here. For advice on how to deal with issues that might arise, please see our guide to troubleshooting. Read more... Settings Our users have found different workflows that work for them. The one you’ll see most often is setting stability around 50 and similarity near 75, with minimal changes thereafter. Of course, this all depends on the original voice and the style of performance you’re aiming for. It’s important to note that the AI is non-deterministic; setting the sliders to specific values won’t guarantee the same results every time. Instead, the sliders function more as a range, determining how wide the randomization can be between each generation. Setting stability low means a wider range of randomization, often resulting in a more emotive performance, but this is also highly dependent on the voice itself. For a more lively and dramatic performance, it is recommended to set the stability slider lower and generate a few times until you find a performance you like. On the other hand, if you want a more serious performance, even bordering on monotone on very high values, it is recommended to set the stability slider higher. And since it’s more consistent and stable, you usually don’t need to do as many generations to get what you are looking for. Experiment to find what works best for you! Read more...

