Prompting
Learn how to control delivery, pronunciation & emotion of text to speech.
We are actively working on Director’s Mode to give you even greater control over outputs.
This guide provides techniques to enhance text-to-speech outputs using ElevenLabs models. Experiment with these methods to discover what works best for your needs. These techniques provide a practical way to achieve nuanced results until advanced features like Director’s Mode are rolled out.
Pauses
Use <break time="x.xs" />
for natural pauses up to 3 seconds. Avoid excessive use to prevent instability.
- Consistency: Use
<break>
tags consistently to maintain natural speech flow. Excessive use can lead to instability. - Voice-Specific Behavior: Different voices may handle pauses differently, especially those trained with filler sounds like “uh” or “ah.”
Alternatives to <break>
include dashes (- or —) for short pauses or ellipses (…) for hesitant tones. However, these are less consistent.
Pronunciation
Specify pronunciation using SSML phoneme tags. Supported alphabets include IPA and CMU Arpabet.
Note: This feature is only compatible with “Eleven English V1” and “Eleven Turbo V2” models.
We recommend using CMU Arpabet for consistent and predictable results with current AI models. While IPA can be effective, CMU Arpabet generally offers more reliable performance.
For more advanced control over pronunciation, explore Pronunciation Dictionaries to customize word pronunciations.
Ensure correct stress marking for multi-syllable words to maintain accurate pronunciation. For example:
For models that don’t support phoneme tags, try writing words more phonetically. You can also employ various tricks such as capital letters, dashes, apostrophes, or even single quotation marks around a single letter or letters.
As an example, a word like “trapezii” could be spelt “trapezIi” to put more emphasis on the “ii” of the word.
Emotion
Convey emotions through narrative context or explicit dialogue tags. This approach helps the AI understand the tone and emotion to emulate.
Explicit dialogue tags yield more predictable results than relying solely on context, however the model will still speak out the emotional delivery guides. These can be removed in post-production using an audio editor if unwanted.
Pace
Pacing can be controlled by writing in a natural, narrative style. For voice cloning, longer, continuous samples are recommended to avoid pacing issues like unnaturally fast speech.
Sample Length: Use longer, continuous samples for voice cloning to avoid pacing issues.
Narrative Style: Write in a narrative style to naturally control pacing and emotion, similar to scriptwriting.
Tips
Common Issues
Inconsistent pauses: Ensure
<break time=“x.xs” />
syntax is used for pauses.- Pronunciation errors: Use CMU Arpabet or IPA phoneme tags for precise pronunciation.
Emotion mismatch: Add narrative context or explicit tags to guide emotion. Remember to remove any emotional guidance text in post-production.
Tips for Improving Output
Experiment with alternative phrasing to achieve desired pacing or emotion. For complex sound effects, break prompts into smaller, sequential elements and combine results manually.
Creative control
While we are actively developing a “Director’s Mode” to give users even greater control over outputs, here are some interim techniques to maximize creativity and precision:
Narrative styling
Write prompts in a narrative style, similar to scriptwriting, to guide tone and pacing effectively.
Layered outputs
Generate sound effects or speech in segments and layer them together using audio editing software for more complex compositions.
Phonetic experimentation
If pronunciation isn’t perfect, experiment with alternate spellings or phonetic approximations to achieve desired results.