- Controls
- Pauses
- Pronunciation
- Phoneme Tags
- Alias Tags
- Pronunciation Dictionaries
- Pronunciation Dictionary examples
- Emotion
- Pace
- Tips
- Creative control
- Text normalization
- Why do models read out inputs differently?
- Common examples
- Mitigation
- Use trained models
- Apply normalization in LLM prompts
- Putting it all together
- Use Regular Expressions for preprocessing
- Prompting Eleven v3 (alpha)
- Voice selection
- Settings
- Stability
- Audio tags
- Voice-related
- Sound effects
- Unique and special
- Punctuation
- Single speaker examples
- Multi-speaker dialogue
- Enhancing input
- Tips
Best practices
This guide provides techniques to enhance text-to-speech outputs using ElevenLabs models. Experiment with these methods to discover what works best for your needs.
Controls
We are actively working on Director’s Mode to give you even greater control over outputs.
These techniques provide a practical way to achieve nuanced results until advanced features like Director’s Mode are rolled out.
Pauses
Use <break time="x.xs" /> for natural pauses up to 3 seconds.
Using too many break tags in a single generation can cause instability. The AI might speed up, or introduce additional noises or audio artifacts. We are working on resolving this.
- Consistency: Use
<break>tags consistently to maintain natural speech flow. Excessive use can lead to instability. - Voice-Specific Behavior: Different voices may handle pauses differently, especially those trained with filler sounds like “uh” or “ah.”
Alternatives to <break> include dashes (- or —) for short pauses or ellipses (…) for hesitant tones. However, these are less consistent.
Pronunciation
Phoneme Tags
Specify pronunciation using SSML phoneme tags. Supported alphabets include CMU Arpabet and the International Phonetic Alphabet (IPA).
Phoneme tags are only compatible with “Eleven Flash v2”, “Eleven Turbo v2” and “Eleven English v1” models.
We recommend using CMU Arpabet for consistent and predictable results with current AI models. While IPA can be effective, CMU Arpabet generally offers more reliable performance.
Phoneme tags only work for individual words. If for example you have a name with a first and last name that you want to be pronounced a certain way, you will need to create a phoneme tag for each word.
Ensure correct stress marking for multi-syllable words to maintain accurate pronunciation. For example:
Alias Tags
For models that don’t support phoneme tags, you can try writing words more phonetically. You can also employ various tricks such as capital letters, dashes, apostrophes, or even single quotation marks around a single letter or letters.
As an example, a word like “trapezii” could be spelt “trapezIi” to put more emphasis on the “ii” of the word.
You can either replace the word directly in your text, or if you want to specify pronunciation using other words or phrases when using a pronunciation dictionary, you can use alias tags for this. This can be useful if you’re generating using Multilingual v2 or Turbo v2.5, which don’t support phoneme tags. You can use pronunciation dictionaries with Studio, Dubbing Studio and Speech Synthesis via the API.
For example, if your text includes a name that has an unusual pronunciation that the AI might struggle with, you could use an alias tag to specify how you would like it to be pronounced:
If you want to make sure that an acronym is always delivered in a certain way whenever it is incountered in your text, you can use an alias tag to specify this:
Pronunciation Dictionaries
Some of our tools, such as Studio and Dubbing Studio, allow you to create and upload a pronunciation dictionary. These allow you to specify the pronunciation of certain words, such as character or brand names, or to specify how acronyms should be read.
Pronunciation dictionaries allow this functionality by enabling you to upload a lexicon or dictionary file that specifies pairs of words and how they should be pronounced, either using a phonetic alphabet or word substitutions.
Whenever one of these words is encountered in a project, the AI model will pronounce the word using the specified replacement.
To provide a pronunciation dictionary file, open the settings for a project and upload a file in either TXT or the .PLS format. When a dictionary is added to a project it will automatically recalculate which pieces of the project will need to be re-converted using the new dictionary file and mark these as unconverted.
Currently we only support pronunciation dictionaries that specify replacements using phoneme or alias tags.
Both phonemes and aliases are sets of rules that specify a word or phrase they are looking for, referred to as a grapheme, and what it will be replaced with. Please note that searches are case sensitive. When checking for a replacement word in a pronunciation dictionary, the dictionary is checked from start to end and only the very first replacement is used.
Pronunciation Dictionary examples
Here are examples of pronunciation dictionaries in both CMU Arpabet and IPA, including a phoneme to specify the pronunciation of “Apple” and an alias to replace “UN” with “United Nations”:
To generate a pronunciation dictionary .pls file, there are a few open source tools available:
- Sequitur G2P - Open-source tool that learns pronunciation rules from data and can generate phonetic transcriptions.
- Phonetisaurus - Open-source G2P system trained on existing dictionaries like CMUdict.
- eSpeak - Speech synthesizer that can generate phoneme transcriptions from text.
- CMU Pronouncing Dictionary - A pre-built English dictionary with phonetic transcriptions.
Emotion
Convey emotions through narrative context or explicit dialogue tags. This approach helps the AI understand the tone and emotion to emulate.
Explicit dialogue tags yield more predictable results than relying solely on context, however the model will still speak out the emotional delivery guides. These can be removed in post-production using an audio editor if unwanted.
Pace
The pacing of the audio is highly influenced by the audio used to create the voice. When creating your voice, we recommend using longer, continuous samples to avoid pacing issues like unnaturally fast speech.
For control over the speed of the generated audio, you can use the speed setting. This allows you to either speed up or slow down the speed of the generated speech. The speed setting is available in Text to Speech via the website and API, as well as in Studio and Agents Platform. It can be found in the voice settings.
The default value is 1.0, which means that the speed is not adjusted. Values below 1.0 will slow the voice down, to a minimum of 0.7. Values above 1.0 will speed up the voice, to a maximum of 1.2. Extreme values may affect the quality of the generated speech.
Pacing can also be controlled by writing in a natural, narrative style.
Tips
Common Issues
Inconsistent pauses: Ensure
<break time=“x.xs” />syntax is used for pauses.- Pronunciation errors: Use CMU Arpabet or IPA phoneme tags for precise pronunciation.
Emotion mismatch: Add narrative context or explicit tags to guide emotion. Remember to remove any emotional guidance text in post-production.
Tips for Improving Output
Experiment with alternative phrasing to achieve desired pacing or emotion. For complex sound effects, break prompts into smaller, sequential elements and combine results manually.
Creative control
While we are actively developing a “Director’s Mode” to give users even greater control over outputs, here are some interim techniques to maximize creativity and precision:
Narrative styling
Write prompts in a narrative style, similar to scriptwriting, to guide tone and pacing effectively.
Layered outputs
Generate sound effects or speech in segments and layer them together using audio editing software for more complex compositions.
Phonetic experimentation
If pronunciation isn’t perfect, experiment with alternate spellings or phonetic approximations to achieve desired results.
Text normalization
When using Text to Speech with complex items like phone numbers, zip codes and emails they might be mispronounced. This is often due to the specific items not being in the training set and smaller models failing to generalize how they should be pronounced. This guide will clarify when those discrepancies happen and how to have them pronounced correctly.
Normalization is enabled by default for all TTS models to help improve pronunciation of numbers, dates, and other complex text elements.
Why do models read out inputs differently?
Certain models are trained to read out numbers and phrases in a more human way. For instance, the phrase “$1,000,000” is correctly read out as “one million dollars” by the Eleven Multilingual v2 model. However, the same phrase is read out as “one thousand thousand dollars” by the Eleven Flash v2.5 model.
The reason for this is that the Multilingual v2 model is a larger model and can better generalize the reading out of numbers in a way that is more natural for human listeners, whereas the Flash v2.5 model is a much smaller model and so cannot.
Common examples
Text to Speech models can struggle with the following:
- Phone numbers (“123-456-7890”)
- Currencies (“$47,345.67”)
- Calendar events (“2024-01-01”)
- Time (“9:23 AM”)
- Addresses (“123 Main St, Anytown, USA”)
- URLs (“example.com/link/to/resource”)
- Abbreviations for units (“TB” instead of “Terabyte”)
- Shortcuts (“Ctrl + Z”)
Mitigation
Use trained models
The simplest way to mitigate this is to use a TTS model that is trained to read out numbers and phrases in a more human way, such as the Eleven Multilingual v2 model. This however might not always be possible, for instance if you have a use case where low latency is critical (e.g. conversational agents).
Apply normalization in LLM prompts
In the case of using an LLM to generate the text for TTS, you can add normalization instructions to the prompt.
Use clear and explicit prompts
LLMs respond best to structured and explicit instructions. Your prompt should clearly specify that you want text converted into a readable format for speech.
Handle different number formats
Not all numbers are read out in the same way. Consider how different number types should be spoken:
- Cardinal numbers: 123 → “one hundred twenty-three”
- Ordinal numbers: 2nd → “second”
- Monetary values: $45.67 → “forty-five dollars and sixty-seven cents”
- Phone numbers: “123-456-7890” → “one two three, four five six, seven eight nine zero”
- Decimals & Fractions: “3.5” → “three point five”, “⅔” → “two-thirds”
- Roman numerals: “XIV” → “fourteen” (or “the fourteenth” if a title)
Remove or expand abbreviations
Common abbreviations should be expanded for clarity:
- “Dr.” → “Doctor”
- “Ave.” → “Avenue”
- “St.” → “Street” (but “St. Patrick” should remain)
You can request explicit expansion in your prompt:
Expand all abbreviations to their full spoken forms.
Alphanumeric normalization
Not all normalization is about numbers, certain alphanumeric phrases should also be normalized for clarity:
- Shortcuts: “Ctrl + Z” → “control z”
- Abbreviations for units: “100km” → “one hundred kilometers”
- Symbols: “100%” → “one hundred percent”
- URLs: “elevenlabs.io/docs” → “eleven labs dot io slash docs”
- Calendar events: “2024-01-01” → “January first, two-thousand twenty-four”
Consider edge cases
Different contexts might require different conversions:
- Dates: “01/02/2023” → “January second, twenty twenty-three” or “the first of February, twenty twenty-three” (depending on locale)
- Time: “14:30” → “two thirty PM”
If you need a specific format, explicitly state it in the prompt.
Putting it all together
This prompt will act as a good starting point for most use cases:
Use Regular Expressions for preprocessing
If using code to prompt an LLM, you can use regular expressions to normalize the text before providing it to the model. This is a more advanced technique and requires some knowledge of regular expressions. Here are some simple examples:
Prompting Eleven v3 (alpha)
This guide provides the most effective tags and techniques for prompting Eleven v3, including voice selection, changes in capitalization, punctuation, audio tags and multi-speaker dialogue. Experiment with these methods to discover what works best for your specific voice and use case.
Eleven v3 is in alpha. Very short prompts are more likely to cause inconsistent outputs. We encourage you to experiment with prompts greater than 250 characters.
Voice selection
The most important parameter for Eleven v3 is the voice you choose. It needs to be similar enough to the desired delivery. For example, if the voice is shouting and you use the audio tag [whispering], it likely won’t work well.
When creating IVCs, you should include a broader emotional range than before. As a result, voices in the voice library may produce more variable results compared to the v2 and v2.5 models. We’ve compiled over 22 excellent voices for V3 here.
Choose voices strategically based on your intended use:
Emotionally diverse
For expressive IVC voices, vary emotional tones across the recording—include both neutral and dynamic samples.
Targeted niche
For specific use cases like sports commentary, maintain consistent emotion throughout the dataset.
Neutral
Neutral voices tend to be more stable across languages and styles, providing reliable baseline performance.
Professional Voice Clones (PVCs) are currently not fully optimized for Eleven v3, resulting in potentially lower clone quality compared to earlier models. During this research preview stage it would be best to find an Instant Voice Clone (IVC) or designed voice for your project if you need to use v3 features.
Settings
Stability
The stability slider is the most important setting in v3, controlling how closely the generated voice adheres to the original reference audio.

- Creative: More emotional and expressive, but prone to hallucinations.
- Natural: Closest to the original voice recording—balanced and neutral.
- Robust: Highly stable, but less responsive to directional prompts but consistent, similar to v2.
For maximum expressiveness with audio tags, use Creative or Natural settings. Robust reduces responsiveness to directional prompts.
Audio tags
Eleven v3 introduces emotional control through audio tags. You can direct voices to laugh, whisper, act sarcastic, or express curiosity among many other styles. Speed is also controlled through audio tags.
The voice you choose and its training samples will affect tag effectiveness. Some tags work well
with certain voices while others may not. Don’t expect a whispering voice to suddenly shout with a
[shout] tag.
Voice-related
These tags control vocal delivery and emotional expression:
[laughs],[laughs harder],[starts laughing],[wheezing][whispers][sighs],[exhales][sarcastic],[curious],[excited],[crying],[snorts],[mischievously]
Sound effects
Add environmental sounds and effects:
[gunshot],[applause],[clapping],[explosion][swallows],[gulps]
Unique and special
Experimental tags for creative applications:
[strong X accent](replace X with desired accent)[sings],[woo],[fart]
Some experimental tags may be less consistent across different voices. Test thoroughly before production use.
Punctuation
Punctuation significantly affects delivery in v3:
- Ellipses (…) add pauses and weight
- Capitalization increases emphasis
- Standard punctuation provides natural speech rhythm
Single speaker examples
Use tags intentionally and match them to the voice’s character. A meditative voice shouldn’t shout; a hyped voice won’t whisper convincingly.
Expressive monologue
Dynamic and humorous
Customer service simulation
Multi-speaker dialogue
v3 can handle multi-voice prompts effectively. Assign distinct voices from your Voice Library for each speaker to create realistic conversations.
Dialogue showcase
Glitch comedy
Overlapping timing
Enhancing input
In the ElevenLabs UI, you can automatically generate relevant audio tags for your input text by clicking the “Enhance” button. Behind the scenes this uses an LLM to enhance your input text with the following prompt:
Tips
Tag combinations
You can combine multiple audio tags for complex emotional delivery. Experiment with different combinations to find what works best for your voice.
Voice matching
Match tags to your voice’s character and training data. A serious, professional voice may not
respond well to playful tags like [giggles] or [mischievously].
Text structure
Text structure strongly influences output with v3. Use natural speech patterns, proper punctuation, and clear emotional context for best results.
Experimentation
There are likely many more effective tags beyond this list. Experiment with descriptive emotional states and actions to discover what works for your specific use case.