Introduction

The ElevenLabs Python library is designed to help developers and app innovators to turn written text into lifelike voices. The results are impressive, with convincing human intonation. 

Here is a brief rundown of the benefits of using ElevenLabs in your application:

  • Advanced Deep Learning – ElevenLabs applies deep learning algorithms that capture the nuances, emotion, and richness of human speech.
  • Voice Cloning in Minutes – You can replicate any voice by uploading a short snippet of sample audio. Whether creating a game voiceover or personalizing UX, our voice cloning will bring your project to life.
  • Seamless Integration – Introduce AI-powered speech into your use case with ease. The python library and API uses concise and clean code.

Here are some key features of the ElevenLabs library:

  • Text-to-speech in 28 languages with diverse voice options
  • Detailed voice customization features.
  • Real-time voice synthesis with our streaming API.
  • Advanced speech controls through SSML support.

Getting Started

Before you install the ElevenLabs Python library, there are a few prerequisites to make your system ready:

  • Install Python 3.6 or higher
  • Access pip package manager 
  • Register with ElevenLabs

The ElevenLabs library also requires the following components for audio playback:

ffmpeg For Mac: Install using brew install ffmpeg For Linux and Windows: Refer to https://ffmpeg.org/

mpv For Mac: Install using brew install mpv For Linux and Windows: Refer to https://mpv.io/

Install Latest Python Library Version

To install the latest version enter the following command:

pip install elevenlabs

This will download and install the package from PyPI.

For development versions, you can install directly from GitHub:

pip install git+https://github.com/elevenlabs/elevenlabs-python

Now the library is ready to import into your code:

import elevenlabs

Authenticate ElevenLabs API

The next step is to authenticate with your free API key, which provides additional usage quota. 

You can sign up for free on the ElevenLabs dashboard to obtain a key.

Next, you need to set the key in your Python code:

from elevenlabs import set_api_key

set_api_key("YOUR_API_KEY")

With everything set up, you’re ready to start using the ElevenLabs library. 

Initializing first text-to-speech

With the ElevenLabs library imported and authenticated, we can now generate our first text-to-speech audio output.

The generate() function handles all text-to-speech generation. It takes in the text to synthesize as well as options to specify voice, language, and other parameters.

For starters, let’s use it with just the text:

from elevenlabs import generate 

audio = generate("Hello world! This is my first text-to-speech using ElevenLabs.")

This will use the default voice and settings to produce an audio object containing synthesized speech.

We can now play back audio using the play() function:

from elevenlabs import play

play(audio)

Within seconds we hear the inputted text spoken in a natural voice.

The default voice uses ElevenLabs’ proprietary expressive English voice model. We can also customize the voice by passing a voice parameter to generate():

audio = generate("Welcome to ElevenLabs!", voice="Amy")

Here is all the code used in this section combined for convenience:

pip install elevenlabs  
import elevenlabs

from elevenlabs import set_api_key

set_api_key("YOUR_API_KEY")

from elevenlabs import generate  
  
audio = generate("Hello world! This is my first text-to-speech using ElevenLabs.")

from elevenlabs import play  
  
play(audio)

audio = generate("Welcome to ElevenLabs!", voice="Amy")

There are dozens of natural voices to choose from. That’s just the beginning of the customizations available, which we’ll explore more in the next sections.

Core Features

The ElevenLabs Python library offers two primary capabilities: high-quality text-to-speech and voice cloning.

Text-to-Speech

The generate() function is used for text-to-speech conversion. Supply it with text, and it provides an audio object with synthesized speech.

To generate speech in various languages, use the model parameter:

# Multiple languages  
audio_en = generate("Hello world!", model="eleven_multilingual_v2")  
# English
audio_mono = generate("Hello world!", model="eleven_monolingual_v1")

Numerous voices are accessible for each language. You can specify a voice using its ID or name:

audio = generate("Hi there", voice="Amy")

You can also customize aspects like pitch and speaking rate using VoiceSettings:

from elevenlabs import VoiceSettings  
  
settings = VoiceSettings(speaking_rate=0.8)  
audio = generate("Welcome to our podcast!", voice="Amy", settings=settings)

For real-time applications, utilize stream=True to receive the audio in fragments:

audio_stream = generate("Hello world", stream=True)

Here is all the code used in this section combined for convenience:

# Multiple languages  
audio_en = generate("Hello world!", model="eleven_multilingual_v2")   
# English  
audio_mono = generate("Hello world!", model="eleven_monolingual_v1") 
audio = generate("Hi there", voice="Amy")
from elevenlabs import VoiceSettings  
  
settings = VoiceSettings(speaking_rate=0.8)  
audio = generate("Welcome to our podcast!", voice="Amy", settings=settings)
audio_stream = generate("Hello world", stream=True)

Voice Cloning

ElevenLabs lets you clone voices with a minimum of 30 seconds of audio samples. Provide sample files to the clone() function:

from elevenlabs import clone
voice = clone(
   name = "My Cloned Voice",
   files = ["sample1.mp3", "sample2.mp3"] 
)

Then generate speech using the cloned voice.

audio = generate("Hello world!", voice=voice)

With just these two capabilities - high-fidelity text-to-speech and voice cloning - many applications are possible with ElevenLabs. 

Cloning from samples

For optimal cloning results, deliver at least 30 seconds of diverse high-quality audio samples, covering styles like storytelling, casual talk, and more. Ensure there are pauses and tonal variations.

Save these as individual files, preferably in WAV or FLAC format:

files = [  
  "clip1.wav",  
  "clip2.flac",  
  "clip3.wav"  
]

Pass the samples to clone():

voice = clone(  
  name="Bob",  
  description="A friendly voice",  
  files=files  
)

This will upload the samples and kick off voice cloning to create a synthetic version.

Also once the voice is cloned, it will have a unique ID. You can save this ID for future references, so you won’t need to clone the voice again.

voice_id = voice.voice_id

Customizing Cloned Voices

While the cloned voice mirrors the provided samples, you can further fine-tune voice settings:

from elevenlabs import VoiceSettings  
  
settings = VoiceSettings(speaking_rate=1.1, pitch=0.9)  
voice = clone(name="Bob", files=files, settings=settings)

Using Cloned Voices

Employ the cloned voice with the generate() function:

audio = generate("Hello!", voice=voice)

If you want to use voice_id instead then:

audio = generate("Hello!", voice=voice_id)

This cloned voice can be applied universally. It’s possible to save this voice for later use across multiple projects.

Streaming Audio

With the ElevenLabs package, you can efficiently stream audio in real-time.

To generate audio in streaming mode, use the stream=True parameter with the generate function:

from elevenlabs import generate

audio_stream = generate("Hello world", stream=True)

This returns a generator that yields audio chunks as they become available.

To play streamed audio, you need to collect the chunks and play them sequentially. For example:

from elevenlabs import generate, stream

  
audio_stream = generate("Hello world", stream=True)  
for chunk in audio_stream:  
  stream(chunk)

This allows playing audio with minimal latency, useful for real-time applications. 

ElevenLabs usually delivers latency of less than one second. Latency is as low as 500 ms in some cases.

Advanced Usage

In this section, we’ll explore advanced functionalities such as controlling speech parameters, leveraging SSML, and handling extensive text, among other aspects.

Controlling Speech Rate, Volume, etc.

In this section, we’ll look at how to achieve nuanced control over speech synthesis with the generate function. Here’s how to customize specific attributes:

Voice: Determine the tone of the output. You can specify a voice name, voice_id, or even use a Voice object. The Voice object allows fine-tuning attributes like stability and similarity boost.

voice = "Bella"   # Using voice name  
audio = generate("Hello!", voice=voice)

Model: Select a speech synthesis model. Specify either a model name or directly pass a Model object.

model = "eleven_monolingual_v1"   # Using model name  
audio = generate("Hello world!", model=model)

Handling Large Texts

When dealing with extensive texts, streaming can be beneficial. Using the stream=True parameter in the generate function returns audio incrementally. This approach is particularly effective for long-reads or when latency is a concern. The streaming chunks are optimized by setting the latency parameter, but the 5000 character limit still remains.

audio_stream = generate("A long piece of text...", stream=True, latency=2)  
for chunk in audio_stream:  
    # process or play each chunk

Monitoring Usage and Costs

The User API object helps monitor your subscription details, like the remaining character quota. Keep tabs on your usage to manage costs efficiently.

from elevenlabs.api import User  
user = User.from_api()  
print(user)

Optimization

To minimize costs while maintaining quality:

API Key Management: Ensure you set your API key. It offers better quotas and capabilities over the default setting.

from elevenlabs import set_api_key
set_api_key("YOUR_API_KEY")

Voice Selection: Optimize by selecting voices and models apt for your project rather than using defaults.

Stream for Lengthy Content: For longer content, use the streaming feature to conserve resources.

Adding breaks / pauses

If you want to insert a break in the text, you can use the following <break time="1.5s" /> directly in the text where you want the break to happen. The AI can handle breaks of up to 3 seconds in length, and these are not just cuts in the audio. These are proper natural breaks.

"Give me one second to think about it. <break time="1.0s" /> Yes, that would work."

Troubleshooting

Working with advanced voice generation tools like the ElevenLabs Python library can sometimes lead to unexpected issues or errors. Here’s a compilation of common pitfalls, errors, and their solutions to ensure a smooth experience with the library.

Installation Issues

Error: pip is not recognized

Solution: Ensure you have the pip package manager installed. For Python 3, you might need to use pip3 instead of pip.

Error: Conflicting dependencies

Solution: This could be due to other packages requiring different versions. Try creating a new virtual environment using virtualenv or venv and then install the ElevenLabs library there.

Authentication Errors

Error: API key not set or invalid

Solution: Double-check your API key for any typos. If you’re copying it from a website or document, ensure there are no leading or trailing whitespaces. If the error persists, verify the API key’s validity in the ElevenLabs dashboard.

Text-to-Speech Issues

Error: Unsupported language or voice

Solution: Confirm that the language or voice ID you provided is supported by ElevenLabs. Refer to their documentation for a list of supported languages and voice options.

Error: Text length exceeds limit

Solution: The generate() function might have a character limit for individual requests. Break the text into smaller chunks and call the function multiple times if needed.

Voice Cloning Issues

Error: Audio sample too short

Solution: Ensure your voice samples are at least 30 seconds long. The more diverse and high-quality the samples, the better the cloning results will be.

Error: Unsupported audio format

Solution: Make sure your audio samples are in a supported format, preferably WAV or FLAC. Convert any unsupported formats to a supported one using audio editing tools or online converters.

Streaming Issues

Error: Stream buffer underflow

Solution: This can occur if the playback consumes audio faster than it’s being generated. Adjust the latency parameter or implement buffer controls to handle this.

Error: Audio is choppy during playback

Solution: Ensure that the network connection is stable. If you’re processing audio chunks, ensure that the processing isn’t causing delays. Try reducing the complexity of any real-time processing or increasing buffer sizes.

General Issues

Error: Library functions not working after an update

Solution: Check the ElevenLabs documentation or changelog for any breaking changes. Update your code accordingly or consider reverting to a previous version of the library.

Error: Unexpected costs or usage

Solution: Monitor your usage regularly using the User API. Adjust your API calls and workflows to fit within your budget or subscription limits.

Real-World Examples

Building a Text-to-Speech Narrator for a Video Game

Modern video games often rely on intricate storylines and rich narratives to engage players. Utilizing text-to-speech for in-game narrations can be both cost-effective and innovative.

Challenges:

  • Dynamic Scenarios: A game might have multiple paths or outcomes based on player choices. The narration should adjust accordingly.
  • Emotional Resonance: Gameplay can range from intense battles to emotional storytelling. The voice should resonate with the mood.

Solutions:

  • Scenario Mapping: Integrate the ElevenLabs library with the game’s decision-making logic. As gameplay evolves, select relevant text strings for narration.
  • Voice Modulation: Use the library’s voice customization controls to adjust pitch, speaking rate, and tone to match the in-game mood. For instance, during a battle scene, opt for a faster, more intense narration.

Developing a Voice Assistant with a Cloned Voice

Personal voice assistants like Siri or Alexa are common. If brands, or even individuals, want a unique voice ElevenLabs offers the ideal solution.

Challenges:

  • Voice Uniqueness: Creating a voice that doesn’t sound generic or robotic.
  • Consistency: Ensuring the cloned voice maintains consistent quality across diverse inputs.

Solutions:

  • Cloning: Use ElevenLabs’ voice cloning feature. With just 30 seconds of sample audio, create a voice that’s distinctive.
  • Quality Checks: Regularly test the cloned voice with diverse scripts to ensure consistent quality. Adjust voice settings or provide more diverse training data if required.

Creating Audio Versions of Articles

With the rise of podcasts and audio content, there’s growing demand for article-to-audio conversions, making content consumption more accessible and flexible.

Challenges:

  • Diverse Formats: Articles can be interviews, op-eds, research pieces, or narratives. Each format has its unique tone and structure.
  • Maintaining Engagement: Ensuring the audio version retains the engagement of the written content.

Solutions:

  • Format Recognition: Use simple text analysis to identify the format. For interviews, detect dialogues. For research pieces, recognize data or structured points.
  • Voice Customization:
  • Interviews: Use different voice models for different speakers to mimic an actual conversation.
  • Op-eds: Opt for a more assertive and confident voice model.
  • Research pieces: A neutral and clear voice can ensure complex data is easily understood.
  • Enhancements: Incorporate short pauses or musical interludes for longer pieces to break monotony and retain listener interest.

Case Studies

GPT-Voice-Conversation-Chatbot (GPT-VCC)

GPT-VCC is a versatile chatbot that allows users to interact with ChatGPT or GPT-4 either vocally using a microphone or via typing. 

Key features include:

  • Memory Retention: The bot can remember conversations during a session, overcoming GPT’s token limitations.
  • API Integration: A valid OpenAI API key is essential. The bot uses OpenAI’s moderation tools and the NLTK to align with OpenAI’s content policies.
  • User Interface: Choose between a graphical or command-line interface.
  • Customization: Adjust the bot’s voice (using tools like Google’s TTS and ElevenLabs), name, and creativity level.
  • Voice Commands: A suite of commands allows users to modify the interaction experience, such as changing the bot’s response style or toggling between GPT models.
  • Use Cases: Ideal for casual chats, language practice, writing assistance, and programming help.

AIUI: Voice Interface for AI Overview

AIUI transforms AI interactions, emphasizing voice rather than point-and-click interfaces.

  • Platform Support: Works on both desktop and mobile browsers.
  • AI Model Support: Compatible with GPT-4 and GPT-3.5, with plans for open model integration.
  • How it Works: Navigate to AIUI on your browser, start speaking, and get a vocal AI response, enabling fluid conversations.

Local Deployment:

  • Clone: git clone git@github.com:lspahija/AIUI.git
  • Use Docker for setup.
  • Navigate to localhost:8000.

Customizations:

  • Default AI Model: gpt-3.5-turbo (changeable with AI_COMPLETION_MODEL variable).
  • Adjustable audio speed via AUDIO_SPEED.
  • Configurable languages with LANGUAGE variables (defaults to English). Some limitations apply based on the TTS provider.

RealChar: Realtime AI Character Overview

At RealChar, you can create, tailor, and converse with your AI companion on-the-spot.

Platform: RealChar.ai (Also beta-testing an iOS app)

Key Highlights:

  • User-Friendly: Create AI characters without coding.
  • Tailorable: Adjust your AI character’s traits, history, and voice.
  • Real-Time Interaction: Engage with your AI character instantly.
  • Multi-Platform: Available on web, terminal, and mobile (Open sourced mobile app).
  • Cutting-Edge AI: Incorporates latest AI tech like OpenAI, Anthropic Claude 2, etc.
  • Modularity: Swap modules for flexibility. A great kick-off for AI enthusiasts.
  • Tech Specs:

Feedback Loop: Sharing and Growing Together

We believe that when creative minds come together, we can all benefit each other. 

Sharing your experience of using our library helps to enrich our community and inspire further advancements. Here’s how you can be an integral part of the feedback loop:

  1. Share Your Projects: Built something cool using our library? Don’t keep it to yourself. Show it off to the community. Whether it’s a small tweak or a grand application, every project has a story to tell.
  2. Contribute to Discussions: Engage in meaningful conversations on our forum or discussion boards. Your insights, challenges, and solutions can be a beacon for others navigating similar waters.
  3. Write About It: If you’ve explored unique implementations or discovered novel solutions, consider penning a blog post or article. Written resources can be immensely beneficial for newcomers and veterans alike.
  4. Collaborate: Open to teaming up? Find like-minded developers in the community to collaborate on bigger projects or to merge innovative ideas.
  5. Feedback & Suggestions: Continuous improvement is at the heart of development. If you have suggestions, critiques, or ideas for the library, we’re all ears.

Summary

Our library offers an array of features tailored to meet the diverse needs of developers looking for text-to-speech capabilities. From intuitive user interfaces to robust backend integrations, it stands out as a comprehensive tool for various applications.

Limitations: In the spirit of transparency, it’s crucial to acknowledge that no tool is without its limitations. Our library may require a learning curve for newcomers, and there might be specific niche scenarios where it may not be the ideal fit. We work hard to address these issues and greatly appreciate user feedback that helps us improve.

Community and Support: Check out our Discord channel where developers can interact and collaborate on projects. 

Continuous Learning: The tech world is ever-evolving, and so is our library. To ensure our users are always ahead of the curve, we offer a wide range of resources. From advanced tutorials to enlightening webinars and in-depth courses, you can find what you need in our resources section.