# Add Chapter To A Project post /v1/projects/{project_id}/chapters/add Creates a new chapter either as blank or from a URL. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Add Project post /v1/projects/add Creates a new project, it can be either initialized as blank, from a document or from a URL. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Add Sharing Voice post /v1/voices/add/{public_user_id}/{voice_id} Add a sharing voice to your collection of voices in VoiceLab. # Add Voice post /v1/voices/add Add a new voice to your collection of voices in VoiceLab. ## Voice Cloning API Usage If you provide a list of file paths to audio recordings intended for voice cloning, you confirm that you have all necessary rights or consents to upload and clone the voice samples they contain and that you will not use the platform-generated content for any illegal, fraudulent, or harmful purpose. You reaffirm your obligation to abide by ElevenLabs’ [Terms of Service](https://elevenlabs.io/terms-of-use), [Prohibited Use Policy](https://elevenlabs.io/use-policy) and [Privacy Policy](https://elevenlabs.io/privacy-policy). # Audio Isolation post /v1/audio-isolation Removes background noise from audio ## Pricing The API is charged at 1000 characters per minute of audio. ## Removing background noise with our Python SDK Our Audio Isolation API is what powers our Voice Isolator, which removes background noise from audio and leaves you with crystal clear dialogue. To get started, here's an example you can follow using our [Python SDK.](https://github.com/elevenlabs/elevenlabs-python) ```python from elevenlabs.client import ElevenLabs # Initialize the client with your API key client = ElevenLabs(api_key="your api key") # Path to the audio file you want to isolate audio_file_path = "sample_file.mp3" with open(audio_file_path, "rb") as audio_file: # Perform audio isolation isolated_audio_iterator = client.audio_isolation.audio_isolation(audio=audio_file) # Save the isolated audio to a new file output_file_path = "cleaned_file.mp3" with open(output_file_path, "wb") as output_file: for chunk in isolated_audio_iterator: output_file.write(chunk) print(f"Isolated audio saved to {output_file_path}") ``` # Audio Isolation Stream post /v1/audio-isolation/stream Removes background noise from audio and streams the result ## Pricing The API is charged at 1000 characters per minute of audio. # Convert Chapter post /v1/projects/{project_id}/chapters/{chapter_id}/convert Starts conversion of a specific chapter. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Convert Project post /v1/projects/{project_id}/convert Starts conversion of a project and all of its chapters. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Dub A Video Or An Audio File post /v1/dubbing Dubs provided audio or video file into given language. # Creates Audionative Enabled Project. post /v1/audio-native Creates AudioNative enabled project, optionally starts conversion and returns project id and embeddable html snippet. # Delete Chapter delete /v1/projects/{project_id}/chapters/{chapter_id} Delete a chapter by its chapter_id. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Delete Dubbing Project delete /v1/dubbing/{dubbing_id} Deletes a dubbing project. # Delete History Item delete /v1/history/{history_item_id} Delete a history item by its ID # Delete Project delete /v1/projects/{project_id} Delete a project by its project_id. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Delete Sample delete /v1/voices/{voice_id}/samples/{sample_id} Removes a sample by its ID. # Delete Voice delete /v1/voices/{voice_id} Deletes a voice by its ID. # Download History Items post /v1/history/download Download one or more history items. If one history item ID is provided, we will return a single audio file. If more than one history item IDs are provided, we will provide the history items packed into a .zip file. # Edit Voice post /v1/voices/{voice_id}/edit Edit a voice created by you. # Edit Voice Settings post /v1/voices/{voice_id}/settings/edit Edit your settings for a specific voice. "similarity_boost" corresponds to"Clarity + Similarity Enhancement" in the web app and "stability" corresponds to "Stability" slider in the web app. # Generate A Random Voice post /v1/voice-generation/generate-voice Generate a random voice based on parameters. This method returns a generated_voice_id in the response header, and a sample of the voice in the body. If you like the generated voice call /v1/voice-generation/create-voice with the generated_voice_id to create the voice. This API is deprecated. Please use the new [Text to Voice API](/api-reference/ttv-create-previews). # Voice Generation Parameters get /v1/voice-generation/generate-voice/parameters Get possible parameters for the /v1/voice-generation/generate-voice endpoint. This API is deprecated. Please use the new [Text to Voice API](/api-reference/ttv-create-previews). # Get Audio From History Item get /v1/history/{history_item_id}/audio Returns the audio of an history item. # Get Audio From Sample get /v1/voices/{voice_id}/samples/{sample_id}/audio Returns the audio corresponding to a sample attached to a voice. # Get Chapter By Id get /v1/projects/{project_id}/chapters/{chapter_id} Returns information about a specific chapter. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Get Chapter Snapshots get /v1/projects/{project_id}/chapters/{chapter_id}/snapshots Gets information about all the snapshots of a chapter, each snapshot corresponds can be downloaded as audio. Whenever a chapter is converted a snapshot will be automatically created. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Get Chapters get /v1/projects/{project_id}/chapters Returns a list of your chapters for a project together and its metadata. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Get Default Voice Settings. get /v1/voices/settings/default Gets the default settings for voices. "similarity_boost" corresponds to"Clarity + Similarity Enhancement" in the web app and "stability" corresponds to "Stability" slider in the web app. # Get Dubbed File get /v1/dubbing/{dubbing_id}/audio/{language_code} Returns dubbed file as a streamed file. Videos will be returned in MP4 format and audio only dubs will be returned in MP3. # Get Dubbing Project Metadata get /v1/dubbing/{dubbing_id} Returns metadata about a dubbing project, including whether it's still in progress or not # Get Transcript For Dub get /v1/dubbing/{dubbing_id}/transcript/{language_code} Returns transcript for the dub as an SRT file. # Get Generated Items get /v1/history Returns metadata about all your generated audio. # Get History Item By Id get /v1/history/{history_item_id} Returns information about an history item by its ID. # Get Models get /v1/models Gets a list of available models. # Get Project By Id get /v1/projects/{project_id} Returns information about a specific project. This endpoint returns more detailed information about a project than GET api.elevenlabs.io/v1/projects. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Get Project Snapshots get /v1/projects/{project_id}/snapshots Gets the snapshots of a project. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Get Projects get /v1/projects Returns a list of your projects together and its metadata. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Get User Info get /v1/user Gets information about the user # Get User Subscription Info get /v1/user/subscription Gets extended information about the users subscription # Get Voice get /v1/voices/{voice_id} Returns metadata about a specific voice. # Get Voice Settings get /v1/voices/{voice_id}/settings Returns the settings for a specific voice. "similarity_boost" corresponds to"Clarity + Similarity Enhancement" in the web app and "stability" corresponds to "Stability" slider in the web app. # Get Voices get /v1/voices Gets a list of all available voices for a user. # API Reference Overview Overview of ElevenLabs API endpoints and capabilities Convert text into lifelike speech with industry-leading quality and latency Clone voices from audio while preserving emotion and intonation Generate AI-powered sound effects and audio for any use case Separate speech from background noise in audio files Create interactive AI voice experiences with WebSocket agents Access and manage your generated audio history Access subscription information and user details Access and manage your custom AI voice collection Generate custom voices from text descriptions Browse and use our collection of shared voices Organize and manage your audio generation projects Create and manage custom pronunciation rules Access our selection of AI voice models Create and manage audio-enabled web projects Automatically translate and dub audio content Manage team members and workspace settings Monitor character and API usage metrics All API endpoints require authentication using your API key. Click through to each section for detailed endpoint documentation. # Add from file post /v1/pronunciation-dictionaries/add-from-file Creates a new pronunciation dictionary from a lexicon .PLS file ## Adding a pronunciation dictionary Here is some example code for uploading a [pronunciation dictionary](https://elevenlabs.io/docs/projects/overview#pronunciation-dictionaries) and printing the response `pronunciation_dictionary_id` and `version_id`. You'll require these identifiers in the request body if you intend to use `pronunciation_dictionary_locators`. All you will need to do is replace `API_KEY_HERE` with your actual API key and `PATH_HERE` with the actual path to the PLS file you want to upload. If you need help in understanding how to properly format a PLS / pronunciation dictionary, please refer to the guide [here](https://elevenlabs.io/docs/projects/overview#pronunciation-dictionaries). There is currently no way to fetch or update old dictionaries uploaded. Therefore, you will need to keep track of the identifiers. If you need to update the dictionary, you will have to upload a new one. ```python import requests import os # Define your API key and the base URL for the Eleven Labs API XI_API_KEY = "API_KEY_HERE" BASE_URL = "https://api.elevenlabs.io/v1" # Setup the headers for HTTP requests to include the API key and accept JSON responses headers = { "Accept": "application/json", "xi-api-key": XI_API_KEY } def upload_pronunciation_dictionary(file_path, name, description): """ Uploads a pronunciation dictionary file to the Eleven Labs API and returns its ID and version ID. Parameters: - file_path: The local path to the pronunciation dictionary file. - name: A name for the pronunciation dictionary. - description: A description of the pronunciation dictionary. Returns: A tuple containing the pronunciation dictionary ID and version ID if successful, None otherwise. """ # Construct the URL for adding a pronunciation dictionary from a file url = f"{BASE_URL}/pronunciation-dictionaries/add-from-file" # Prepare the file and data to be sent in the request files = {'file': open(file_path, 'rb')} data = {'name': name, 'description': description} # Make the POST request to upload the dictionary response = requests.post(url, headers=headers, files=files, data=data) # Handle the response if response.status_code == 200: # Parse the response JSON to get the pronunciation dictionary and version IDs data = response.json() pronunciation_dictionary_id = data.get('id') version_id = data.get('version_id') # Return the IDs return pronunciation_dictionary_id, version_id else: # Print an error message if the request failed print("Error:", response.status_code) return None, None def main(): """ The main function to upload a pronunciation dictionary. """ # Define the path to your pronunciation dictionary file and its metadata file_path = r"PATH_HERE" name = "Your Pronunciation Dictionary" description = "My custom pronunciation dictionary" # Upload the pronunciation dictionary and receive its ID and version ID pronunciation_dictionary_id, version_id = upload_pronunciation_dictionary(file_path, name, description) # Check if the upload was successful if pronunciation_dictionary_id and version_id: print("Pronunciation Dictionary Uploaded Successfully!") print("Pronunciation Dictionary ID:", pronunciation_dictionary_id) print("Version ID:", version_id) else: print("Failed to upload pronunciation dictionary.") # Ensure this script block runs only when executed as a script, not when imported if __name__ == "__main__": main() ``` ## Using a pronunciation dictionary Here is some example code on how to use these identifiers or locators in your text-to-speech call. ```python import requests # Set your API key and base URL XI_API_KEY = "API_KEY_HERE" BASE_URL = "https://api.elevenlabs.io/v1" VOICE_ID = "TxGEqnHWrfWFTfGW9XjX" # Headers for the request headers = { "Accept": "application/json", "xi-api-key": XI_API_KEY } def text_to_speech(text, pronunciation_dictionary_id, version_id): """ Sends a text to speech request using a pronunciation dictionary. Returns: An audio file. """ # Define the URL for the text-to-speech endpoint url = f"{BASE_URL}/text-to-speech/{VOICE_ID}" # Payload for the request payload = { "model_id": "eleven_monolingual_v1", "pronunciation_dictionary_locators": [ { "pronunciation_dictionary_id": pronunciation_dictionary_id, "version_id": version_id } ], "text": text, "voice_settings": { "stability": 0.5, "similarity_boost": 0.8, "style": 0.0, "use_speaker_boost": True } } # Make the POST request response = requests.post(url, json=payload, headers=headers) # Check the response status if response.status_code == 200: # Here you can save the audio response to a file if needed print("Audio file generated successfully.") # Save the audio to a file with open("output_audio.mp3", "wb") as audio_file: audio_file.write(response.content) else: print("Error:", response.status_code) def main(): # Example text and dictionary IDs (replace with actual values) text = "Hello, world! I can now use pronunciation dictionaries." pronunciation_dictionary_id = "PD_ID_HERE" version_id = "VERSION_ID_HERE" # Call the text to speech function text_to_speech(text, pronunciation_dictionary_id, version_id) if __name__ == "__main__": main() ``` # Get dictionary by id get /v1/pronunciation-dictionaries/{pronunciation_dictionary_id}/ Get metadata for a pronunciation dictionary # Add rules post /v1/pronunciation-dictionaries/{pronunciation_dictionary_id}/add-rules Add rules to the pronunciation dictionary # Remove rules post /v1/pronunciation-dictionaries/{pronunciation_dictionary_id}/remove-rules Remove rules from the pronunciation dictionary # Download version by id get /v1/pronunciation-dictionaries/{dictionary_id}/{version_id}/download Get PLS file with a pronunciation dictionary version rules # Get dictionaries get /v1/pronunciation-dictionaries/ Get a list of the pronunciation dictionaries you have access to and their metadata # Get Voices get /v1/shared-voices Gets a list of shared voices. # Node Library # Python Library # Sound Generation post /v1/sound-generation API that converts text into sounds & uses the most advanced AI audio model ever. Create sound effects for your videos, voice-overs or video games. ## Pricing The API is charged at 100 characters per generation with automatic duration or 25 characters per second with set duration. # Speech To Speech post /v1/speech-to-speech/{voice_id} Use Speech to Speech API to transform uploaded speech so it sounds like it was spoken by another voice. STS gives you full control over the emotions, timing and delivery. ## Audio generation Generating speech-to-speech involves a similar process to text-to-speech, but with some adjustments in the API parameters. Instead of providing text when calling the API, you provide the path to an audio file that you would like to convert from one voice to another. Here’s a modified version of your code to illustrate how to generate speech-to-speech using the given API: ```python # Import necessary libraries import requests # Used for making HTTP requests import json # Used for working with JSON data # Define constants for the script CHUNK_SIZE = 1024 # Size of chunks to read/write at a time XI_API_KEY = "" # Your API key for authentication VOICE_ID = "" # ID of the voice model to use AUDIO_FILE_PATH = "" # Path to the input audio file OUTPUT_PATH = "output.mp3" # Path to save the output audio file # Construct the URL for the Speech-to-Speech API request sts_url = f"https://api.elevenlabs.io/v1/speech-to-speech/{VOICE_ID}/stream" # Set up headers for the API request, including the API key for authentication headers = { "Accept": "application/json", "xi-api-key": XI_API_KEY } # Set up the data payload for the API request, including model ID and voice settings # Note: voice settings are converted to a JSON string data = { "model_id": "eleven_english_sts_v2", "voice_settings": json.dumps({ "stability": 0.5, "similarity_boost": 0.8, "style": 0.0, "use_speaker_boost": True }) } # Set up the files to send with the request, including the input audio file files = { "audio": open(AUDIO_FILE_PATH, "rb") } # Make the POST request to the STS API with headers, data, and files, enabling streaming response response = requests.post(sts_url, headers=headers, data=data, files=files, stream=True) # Check if the request was successful if response.ok: # Open the output file in write-binary mode with open(OUTPUT_PATH, "wb") as f: # Read the response in chunks and write to the file for chunk in response.iter_content(chunk_size=CHUNK_SIZE): f.write(chunk) # Inform the user of success print("Audio stream saved successfully.") else: # Print the error message if the request was not successful print(response.text) ``` ## Voices We offer 1000s of voices in 29 languages. Visit the [Voice Lab](https://elevenlabs.io/voice-lab) to explore our pre-made voices or [clone your own](https://elevenlabs.io/voice-cloning). Visit the [Voices Library](https://elevenlabs.io/voice-library) to see voices generated by ElevenLabs users. ## Supported languages Our STS API is multilingual and currently supports the following languages: `Chinese, Korean, Dutch, Turkish, Swedish, Indonesian, Filipino, Japanese, Ukrainian, Greek, Czech, Finnish, Romanian, Russian, Danish, Bulgarian, Malay, Slovak, Croatian, Classic Arabic, Tamil, English, Polish, German, Spanish, French, Italian, Hindi and Portuguese`. To use them, simply provide the input audio in the language of your choice. *** # Streaming post /v1/speech-to-speech/{voice_id}/stream Create speech by combining the content and emotion of the uploaded audio with a voice of your choice and returns an audio stream. # Stream Chapter Audio post /v1/projects/{project_id}/chapters/{chapter_id}/snapshots/{chapter_snapshot_id}/stream Stream the audio from a chapter snapshot. Use `GET /v1/projects/{project_id}/chapters/{chapter_id}/snapshots` to return the chapter snapshots of a chapter. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Stream Project Audio post /v1/projects/{project_id}/snapshots/{project_snapshot_id}/stream Stream the audio from a project snapshot. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). # Text To Speech Streaming post /v1/text-to-speech/{voice_id}/stream Converts text into speech using a voice of your choice and returns audio as an audio stream. # Text To Speech Streaming With Timestamps post /v1/text-to-speech/{voice_id}/stream/with-timestamps Converts text into audio together with timestamps on when which word was spoken in a streaming way. ## Audio generation You can generate audio together with information on when which character was spoken in a streaming way using the following script: ```python import requests import json import base64 VOICE_ID = "21m00Tcm4TlvDq8ikWAM" # Rachel YOUR_XI_API_KEY = "ENTER_YOUR_API_KEY_HERE" url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream/with-timestamps" headers = { "Content-Type": "application/json", "xi-api-key": YOUR_XI_API_KEY } data = { "text": ( "Born and raised in the charming south, " "I can add a touch of sweet southern hospitality " "to your audiobooks and podcasts" ), "model_id": "eleven_multilingual_v2", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75 } } response = requests.post( url, json=data, headers=headers, stream=True ) if response.status_code != 200: print(f"Error encountered, status: {response.status_code}, " f"content: {response.text}") quit() audio_bytes = b"" characters = [] character_start_times_seconds = [] character_end_times_seconds = [] for line in response.iter_lines(): if line: # filter out keep-alive new line # convert the response which contains bytes into a JSON string from utf-8 encoding json_string = line.decode("utf-8") # parse the JSON string and load the data as a dictionary response_dict = json.loads(json_string) # the "audio_base64" entry in the dictionary contains the audio as a base64 encoded string, # we need to decode it into bytes in order to save the audio as a file audio_bytes_chunk = base64.b64decode(response_dict["audio_base64"]) audio_bytes += audio_bytes_chunk if response_dict["alignment"] is not None: characters.extend(response_dict["alignment"]["characters"]) character_start_times_seconds.extend(response_dict["alignment"]["character_start_times_seconds"]) character_end_times_seconds.extend(response_dict["alignment"]["character_end_times_seconds"]) with open('output.mp3', 'wb') as f: f.write(audio_bytes) print({ "characters": characters, "character_start_times_seconds": character_start_times_seconds, "character_end_times_seconds": character_end_times_seconds }) ``` This prints out a dictionary like: ```python { 'characters': ['B', 'o', 'r', 'n', ' ', 'a', 'n', 'd', ' ', 'r', 'a', 'i', 's', 'e', 'd', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'c', 'h', 'a', 'r', 'm', 'i', 'n', 'g', ' ', 's', 'o', 'u', 't', 'h', ',', ' ', 'I', ' ', 'c', 'a', 'n', ' ', 'a', 'd', 'd', ' ', 'a', ' ', 't', 'o', 'u', 'c', 'h', ' ', 'o', 'f', ' ', 's', 'w', 'e', 'e', 't', ' ', 's', 'o', 'u', 't', 'h', 'e', 'r', 'n', ' ', 'h', 'o', 's', 'p', 'i', 't', 'a', 'l', 'i', 't', 'y', ' ', 't', 'o', ' ', 'y', 'o', 'u', 'r', ' ', 'a', 'u', 'd', 'i', 'o', 'b', 'o', 'o', 'k', 's', ' ', 'a', 'n', 'd', ' ', 'p', 'o', 'd', 'c', 'a', 's', 't', 's'], 'character_start_times_seconds': [0.0, 0.186, 0.279, 0.348, 0.406, 0.441, 0.476, 0.499, 0.522, 0.58, 0.65, 0.72, 0.778, 0.824, 0.882, 0.906, 0.952, 0.975, 1.01, 1.045, 1.068, 1.091, 1.115, 1.149, 1.196, 1.254, 1.3, 1.358, 1.416, 1.474, 1.498, 1.521, 1.602, 1.66, 1.811, 1.869, 1.927, 1.974, 2.009, 2.043, 2.067, 2.136, 2.183, 2.218, 2.252, 2.287, 2.322, 2.357, 2.392, 2.426, 2.45, 2.508, 2.531, 2.589, 2.635, 2.682, 2.717, 2.763, 2.786, 2.81, 2.879, 2.937, 3.007, 3.065, 3.123, 3.17, 3.239, 3.286, 3.367, 3.402, 3.437, 3.46, 3.483, 3.529, 3.564, 3.599, 3.634, 3.68, 3.75, 3.82, 3.889, 3.971, 4.087, 4.168, 4.214, 4.272, 4.331, 4.389, 4.412, 4.447, 4.528, 4.551, 4.574, 4.609, 4.644, 4.702, 4.748, 4.807, 4.865, 4.923, 5.016, 5.074, 5.12, 5.155, 5.201, 5.248, 5.283, 5.306, 5.329, 5.352, 5.41, 5.457, 5.573, 5.654, 5.735, 5.886, 5.944, 6.06], 'character_end_times_seconds': [0.186, 0.279, 0.348, 0.406, 0.441, 0.476, 0.499, 0.522, 0.58, 0.65, 0.72, 0.778, 0.824, 0.882, 0.906, 0.952, 0.975, 1.01, 1.045, 1.068, 1.091, 1.115, 1.149, 1.196, 1.254, 1.3, 1.358, 1.416, 1.474, 1.498, 1.521, 1.602, 1.66, 1.811, 1.869, 1.927, 1.974, 2.009, 2.043, 2.067, 2.136, 2.183, 2.218, 2.252, 2.287, 2.322, 2.357, 2.392, 2.426, 2.45, 2.508, 2.531, 2.589, 2.635, 2.682, 2.717, 2.763, 2.786, 2.81, 2.879, 2.937, 3.007, 3.065, 3.123, 3.17, 3.239, 3.286, 3.367, 3.402, 3.437, 3.46, 3.483, 3.529, 3.564, 3.599, 3.634, 3.68, 3.75, 3.82, 3.889, 3.971, 4.087, 4.168, 4.214, 4.272, 4.331, 4.389, 4.412, 4.447, 4.528, 4.551, 4.574, 4.609, 4.644, 4.702, 4.748, 4.807, 4.865, 4.923, 5.016, 5.074, 5.12, 5.155, 5.201, 5.248, 5.283, 5.306, 5.329, 5.352, 5.41, 5.457, 5.573, 5.654, 5.735, 5.886, 5.944, 6.06, 6.548] } ``` As you can see this dictionary contains three lists of the same size. For example response\_dict\['alignment']\['characters']\[3] contains the fourth character in the text you provided 'n', response\_dict\['alignment']\['character\_start\_times\_seconds']\[3] and response\_dict\['alignment']\['character\_end\_times\_seconds']\[3] contain its start (0.348 seconds) and end (0.406 seconds) timestamps. # Text To Speech post /v1/text-to-speech/{voice_id} API that converts text into lifelike speech with best-in-class latency & uses the most advanced AI audio model ever. Create voiceovers for your videos, audiobooks, or create AI chatbots for free. *** # Introduction Our AI model produces the highest-quality AI voices in the industry. Our [text to speech](https://elevenlabs.io/text-to-speech) [API](https://elevenlabs.io/api) allows you to convert text into audio in 32 languages and 1000s of voices. Integrate our realistic text to speech voices into your react app, use our Python library or our websockets guide to get started. ### API Features 1000s of voices, in 32 languages, for every use-case, at 128kbps As low as \~300ms (+ network latency) audio generation times with our Turbo model. Understands text nuances for appropriate intonation and resonance. *** # Quick Start ## Audio generation Generate spoken audio from text with a simple request like the following Python example: ```python import requests CHUNK_SIZE = 1024 url = "https://api.elevenlabs.io/v1/text-to-speech/" headers = { "Accept": "audio/mpeg", "Content-Type": "application/json", "xi-api-key": "" } data = { "text": "Born and raised in the charming south, I can add a touch of sweet southern hospitality to your audiobooks and podcasts", "model_id": "eleven_monolingual_v1", "voice_settings": { "stability": 0.5, "similarity_boost": 0.5 } } response = requests.post(url, json=data, headers=headers) with open('output.mp3', 'wb') as f: for chunk in response.iter_content(chunk_size=CHUNK_SIZE): if chunk: f.write(chunk) ``` ## Voices We offer 1000s of voices in 29 languages. Visit the [Voice Lab](https://elevenlabs.io/voice-lab) to explore our pre-made voices or [clone your own](https://elevenlabs.io/voice-cloning). Visit the [Voices Library](https://elevenlabs.io/voice-library) to see voices generated by ElevenLabs users. ## Generation & Concurrency Limits All our models support up to 10k characters (\~10 minutes of audio) in a single request. To achieve consistency over long form audio, try [request stitching](https://elevenlabs.io/docs/api-reference/how-to-use-request-stitching). The concurrency limit (the maximum number of concurrent requests you can run in parallel) depends on the tier you are on. * Free: 2 * Starter: 3 * Creator: 5 * Pro: 10 * Scale: 15 * Business: 15 If you need a higher limit, reach out to our [Enterprise team](https://elevenlabs.io/enterprise) to discuss a custom plan. ## Supported languages Our TTS API is multilingual and currently supports the following languages: `Chinese, Korean, Dutch, Turkish, Swedish, Indonesian, Filipino, Japanese, Ukrainian, Greek, Czech, Finnish, Romanian, Russian, Danish, Bulgarian, Malay, Slovak, Croatian, Classic Arabic, Tamil, English, Polish, German, Spanish, French, Italian, Hindi, Portuguese, Hungarian, Vietnamese and Norwegian`. To use them, simply provide the input text in the language of your choice. Dig into the details of using the ElevenLabs TTS API. Learn how to use our API with websockets. A great place to ask questions and get help from the community. Learn how to integrate ElevenLabs into your workflow. *** # Text To Speech With Timestamps post /v1/text-to-speech/{voice_id}/with-timestamps Converts text into audio together with timestamps on when which word was spoken. *** ## Audio generation You can generate audio together with information on when which character was spoken using the following script: ```python import requests import json import base64 VOICE_ID = "21m00Tcm4TlvDq8ikWAM" # Rachel YOUR_XI_API_KEY = "ENTER_YOUR_API_KEY_HERE" url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/with-timestamps" headers = { "Content-Type": "application/json", "xi-api-key": YOUR_XI_API_KEY } data = { "text": ( "Born and raised in the charming south, " "I can add a touch of sweet southern hospitality " "to your audiobooks and podcasts" ), "model_id": "eleven_multilingual_v2", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75 } } response = requests.post( url, json=data, headers=headers, ) if response.status_code != 200: print(f"Error encountered, status: {response.status_code}, " f"content: {response.text}") quit() # convert the response which contains bytes into a JSON string from utf-8 encoding json_string = response.content.decode("utf-8") # parse the JSON string and load the data as a dictionary response_dict = json.loads(json_string) # the "audio_base64" entry in the dictionary contains the audio as a base64 encoded string, # we need to decode it into bytes in order to save the audio as a file audio_bytes = base64.b64decode(response_dict["audio_base64"]) with open('output.mp3', 'wb') as f: f.write(audio_bytes) # the 'alignment' entry contains the mapping between input characters and their timestamps print(response_dict['alignment']) ``` This prints out a dictionary like: ```python { 'characters': ['B', 'o', 'r', 'n', ' ', 'a', 'n', 'd', ' ', 'r', 'a', 'i', 's', 'e', 'd', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'c', 'h', 'a', 'r', 'm', 'i', 'n', 'g', ' ', 's', 'o', 'u', 't', 'h', ',', ' ', 'I', ' ', 'c', 'a', 'n', ' ', 'a', 'd', 'd', ' ', 'a', ' ', 't', 'o', 'u', 'c', 'h', ' ', 'o', 'f', ' ', 's', 'w', 'e', 'e', 't', ' ', 's', 'o', 'u', 't', 'h', 'e', 'r', 'n', ' ', 'h', 'o', 's', 'p', 'i', 't', 'a', 'l', 'i', 't', 'y', ' ', 't', 'o', ' ', 'y', 'o', 'u', 'r', ' ', 'a', 'u', 'd', 'i', 'o', 'b', 'o', 'o', 'k', 's', ' ', 'a', 'n', 'd', ' ', 'p', 'o', 'd', 'c', 'a', 's', 't', 's'], 'character_start_times_seconds': [0.0, 0.186, 0.279, 0.348, 0.406, 0.441, 0.476, 0.499, 0.522, 0.58, 0.65, 0.72, 0.778, 0.824, 0.882, 0.906, 0.952, 0.975, 1.01, 1.045, 1.068, 1.091, 1.115, 1.149, 1.196, 1.254, 1.3, 1.358, 1.416, 1.474, 1.498, 1.521, 1.602, 1.66, 1.811, 1.869, 1.927, 1.974, 2.009, 2.043, 2.067, 2.136, 2.183, 2.218, 2.252, 2.287, 2.322, 2.357, 2.392, 2.426, 2.45, 2.508, 2.531, 2.589, 2.635, 2.682, 2.717, 2.763, 2.786, 2.81, 2.879, 2.937, 3.007, 3.065, 3.123, 3.17, 3.239, 3.286, 3.367, 3.402, 3.437, 3.46, 3.483, 3.529, 3.564, 3.599, 3.634, 3.68, 3.75, 3.82, 3.889, 3.971, 4.087, 4.168, 4.214, 4.272, 4.331, 4.389, 4.412, 4.447, 4.528, 4.551, 4.574, 4.609, 4.644, 4.702, 4.748, 4.807, 4.865, 4.923, 5.016, 5.074, 5.12, 5.155, 5.201, 5.248, 5.283, 5.306, 5.329, 5.352, 5.41, 5.457, 5.573, 5.654, 5.735, 5.886, 5.944, 6.06], 'character_end_times_seconds': [0.186, 0.279, 0.348, 0.406, 0.441, 0.476, 0.499, 0.522, 0.58, 0.65, 0.72, 0.778, 0.824, 0.882, 0.906, 0.952, 0.975, 1.01, 1.045, 1.068, 1.091, 1.115, 1.149, 1.196, 1.254, 1.3, 1.358, 1.416, 1.474, 1.498, 1.521, 1.602, 1.66, 1.811, 1.869, 1.927, 1.974, 2.009, 2.043, 2.067, 2.136, 2.183, 2.218, 2.252, 2.287, 2.322, 2.357, 2.392, 2.426, 2.45, 2.508, 2.531, 2.589, 2.635, 2.682, 2.717, 2.763, 2.786, 2.81, 2.879, 2.937, 3.007, 3.065, 3.123, 3.17, 3.239, 3.286, 3.367, 3.402, 3.437, 3.46, 3.483, 3.529, 3.564, 3.599, 3.634, 3.68, 3.75, 3.82, 3.889, 3.971, 4.087, 4.168, 4.214, 4.272, 4.331, 4.389, 4.412, 4.447, 4.528, 4.551, 4.574, 4.609, 4.644, 4.702, 4.748, 4.807, 4.865, 4.923, 5.016, 5.074, 5.12, 5.155, 5.201, 5.248, 5.283, 5.306, 5.329, 5.352, 5.41, 5.457, 5.573, 5.654, 5.735, 5.886, 5.944, 6.06, 6.548] } ``` As you can see this dictionary contains three lists of the same size. For example response\_dict\['alignment']\['characters']\[3] contains the fourth character in the text you provided 'n', response\_dict\['alignment']\['character\_start\_times\_seconds']\[3] and response\_dict\['alignment']\['character\_end\_times\_seconds']\[3] contain its start (0.348 seconds) and end (0.406 seconds) timestamps. # Generate Voice Previews From Description post /v1/text-to-voice/create-previews Generate custom voice previews based on provided voice description. The response includes a list of voice previews, each containing an id and a sample of the voice audio. If you like the voice preview and want to create a permanent voice, call `/v1/text-to-voice/create-voice-from-preview` with the corresponding voice id. Follow our [Voice Design Prompt Guide](/product/voices/voice-lab/voice-design#voice-design-prompt-guide) for best results. When you hit generate, we'll create three voice previews. You will be charged credits equal to the length of the text you submit (you are charged this amount once per call, even though you receive three voice previews). "Text" should be no less than 100 characters and no more than 1k characters. # Create Voice From Voice Preview post /v1/text-to-voice/create-voice-from-preview Create a new voice from previously generated voice preview. This endpoint should be called after you fetched a `generated_voice_id` using `/v1/text-to-voice/create-previews`. # Update Pronunciation Dictionaries post /v1/projects/{project_id}/update-pronunciation-dictionaries Updates the set of pronunciation dictionaries acting on a project. This will automatically mark text within this project as requiring reconverting where the new dictionary would apply or the old one no longer does. Projects API avaliable upon request. To get access, [contact sales](https://elevenlabs.io/contact-sales). You can use the Pronunciation Dictionaries API to add a pronunciation dictionary from a file in order to get a valid id # Get Characters Usage Metrics get /v1/usage/character-stats Returns the credit usage metrics for the current user or the entire workspace they are part of. The response will return a time axis with unix timestamps for each day and daily usage along that axis. The usage will be broken down by the specified breakdown type. For example, breakdown type "voice" will return the usage of each voice along the time axis. # Websockets This API provides real-time [text-to-speech](https://elevenlabs.io/text-to-speech) conversion using WebSockets. This allows you to send a text message and receive audio data back in real-time. Endpoint:
`wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}`
# When to use The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn't a one-size-fits-all solution. It's well-suited for scenarios where: * The input text is being streamed or generated in chunks. * Word-to-audio alignment information is required. For a practical demonstration in a real world application, refer to the [Example of voice streaming using ElevenLabs and OpenAI](#example-voice-streaming-using-elevenlabs-and-openai) section. # When not to use However, it may not be the best choice when: * The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request. * You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing. In these cases, use the [Text to Speech API](/api-reference/text-to-speech) instead. # Protocol The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects. # Streaming input text The client can send messages with text input to the server. The messages can contain the following fields: ```json { "text": "This is a sample text ", "voice_settings": { "stability": 0.8, "similarity_boost": 0.8 }, "generation_config": { "chunk_length_schedule": [120, 160, 250, 290] }, "xi_api_key": "", "authorization": "Bearer " } ``` Should always end with a single space string `" "`. In the first message, the text should be a space `" "`. This is an advanced setting that most users shouldn't need to use. It relates to our generation schedule explained [here](#understanding-how-our-websockets-buffer-text). Use this to attempt to immediately trigger the generation of audio, overriding the `chunk_length_schedule`. Unlike flush, `try_trigger_generation` will only generate audio if our [buffer](#understanding-how-our-websockets-buffer-text) contains more than a minimum threshold of characters, this is to ensure a higher quality response from our model. Note that overriding the chunk schedule to generate small amounts of text may result in lower quality audio, therefore, only use this parameter if you really need text to be processed immediately. We generally recommend keeping the default value of `false` and adjusting the `chunk_length_schedule` in the `generation_config` instead. This property should only be provided in the first message you send. Defines the stability for voice settings. Defines the similarity boost for voice settings. Defines the style for voice settings. This parameter is available on V2+ models. Defines the use speaker boost for voice settings. This parameter is available on V2+ models. This property should only be provided in the first message you send. This is an advanced setting that most users shouldn't need to use. It relates to our generation schedule explained [here](#understanding-how-our-websockets-buffer-text). Determines the minimum amount of text that needs to be sent and present in our buffer before audio starts being generated. This is to maximise the amount of context available to the model to improve audio quality, whilst balancing latency of the returned audio chunks. The default value is: \[120, 160, 250, 290]. This means that the first chunk of audio will not be generated until you send text that totals at least 120 characters long. The next chunk of audio will only be generated once a further 160 characters have been sent. The third audio chunk will be generated after the next 250 characters. Then the fourth, and beyond, will be generated in sets of at least 290 characters. Customize this array to suit your needs. If you want to generate audio more frequently to optimise latency, you can reduce the values in the array. Note that setting the values too low may result in lower quality audio. Please test and adjust as needed. Each item should be in the range 50-500. Flush forces the generation of audio. Set this value to `true` when you have finished sending text, but want to keep the websocket connection open. This is useful when you want to ensure that the last chunk of audio is generated even when the length of text sent is smaller than the value set in `chunk_length_schedule` (e.g. 120 or 50). To understand more about how our websockets buffer text before audio is generated, please refer to [this](#understanding-how-our-websockets-buffer-text) section. Provide the XI API Key in the first message if it's not in the header. Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided. For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters. ## Close connection In order to close the connection, the client should send an End of Sequence (EOS) message. The EOS message should always be an empty string: ```json End of Sequence (EOS) message { "text": "" } ``` Should always be an empty string `""`. ## Streaming output audio The server will always respond with a message containing the following fields: ```json Response message { "audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZSA6KQ==", "isFinal": false, "normalizedAlignment": { "charStartTimesMs": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21], "charDurationsMs": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3], "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"] }, "alignment": { "charStartTimesMs": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21], "charDurationsMs": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3], "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"] } } ``` A generated partial audio chunk, encoded using the selected output\_format, by default this is MP3 encoded as a base64 string. Indicates if the generation is complete. If set to `True`, `audio` will be null. Alignment information for the generated audio given the input normalized text sequence. A list of starting times (in milliseconds) for each character in the normalized text as it corresponds to the audio. For instance, the character 'H' starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example [here](#example-getting-word-start-times-using-alignment-values) for how to use this. A list providing the duration (in milliseconds) for each character's pronunciation in the audio. For instance, the character 'H' has a pronunciation duration of 3 ms. The list of characters in the normalized text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations. Alignment information for the generated audio given the original text sequence. A list of starting times (in milliseconds) for each character in the original text as it corresponds to the audio. For instance, the character 'H' starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example [here](#example-getting-word-start-times-using-alignment-values) for how to use this. A list providing the duration (in milliseconds) for each character's pronunciation in the audio. For instance, the character 'H' has a pronunciation duration of 3 ms. The list of characters in the original text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations. ## Path parameters Voice ID to be used, you can use [Get Voices](/api-reference/get-voices) to list all the available voices. ## Query parameters Identifier of the model that will be used, you can query them using [Get Models](/api-reference/get-models). Language code (ISO 639-1) used to enforce a language for the model. Currently only Turbo v2.5 supports language enforcement. For other models, an error will be returned if language code is provided. Whether to enable request logging, if disabled the request will not be present in history nor bigtable. Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled. Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend sending SSML tags as fully contained messages to the websockets endpoint, otherwise this may result in additional latency. Please note that rendered text, in normalizedAlignment, will be altered in support of SSML tags. The rendered text will use a . as a placeholder for breaks, and syllables will be reported using the CMU arpabet alphabet where SSML phoneme tags are used to specify pronunciation. Disabled by default. You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values: | Value | Description | | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | 0 | default mode (no latency optimizations) | | 1 | normal latency optimizations (about 50% of possible latency improvement of option 3) | | 2 | strong latency optimizations (about 75% of possible latency improvement of option 3) | | 3 | max latency optimizations | | 4 | max latency optimizations, but also with text normalizer turned off for even more latency savings (best latency, but can mispronounce eg numbers and dates). | Defaults to `0` Output format of the generated audio. Must be one of: | Value | Description | | ---------- | ------------------------------------------------------------------------------------------------------------- | | mp3\_44100 | default output format, mp3 with 44.1kHz sample rate | | pcm\_16000 | PCM format (S16LE) with 16kHz sample rate | | pcm\_22050 | PCM format (S16LE) with 22.05kHz sample rate | | pcm\_24000 | PCM format (S16LE) with 24kHz sample rate | | pcm\_44100 | PCM format (S16LE) with 44.1kHz sample rate | | ulaw\_8000 | μ-law format (mulaw) with 8kHz sample rate. (Note that this format is commonly used for Twilio audio inputs.) | Defaults to `mp3_44100` The number of seconds that the connection can be inactive before it is automatically closed. Defaults to `20` seconds, with a maximum allowed value of `180` seconds. The audio for each text sequence is delivered in multiple chunks. By default when it's set to false, you'll receive all timing data (alignment information) with the first chunk only. However, if you enable this option, you'll get the timing data with every audio chunk instead. This can help you precisely match each audio segment with its corresponding text. # Example - Voice streaming using ElevenLabs and OpenAI The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from OpenAI's GPT model, while the answer is being generated, thereby minimizing the overall latency of the operation. ```python import asyncio import websockets import json import base64 import shutil import os import subprocess from openai import AsyncOpenAI # Define API keys and voice ID OPENAI_API_KEY = '' ELEVENLABS_API_KEY = '' VOICE_ID = '21m00Tcm4TlvDq8ikWAM' # Set OpenAI API key aclient = AsyncOpenAI(api_key=OPENAI_API_KEY) def is_installed(lib_name): return shutil.which(lib_name) is not None async def text_chunker(chunks): """Split text into chunks, ensuring to not break sentences.""" splitters = (".", ",", "?", "!", ";", ":", "—", "-", "(", ")", "[", "]", "}", " ") buffer = "" async for text in chunks: if buffer.endswith(splitters): yield buffer + " " buffer = text elif text.startswith(splitters): yield buffer + text[0] + " " buffer = text[1:] else: buffer += text if buffer: yield buffer + " " async def stream(audio_stream): """Stream audio data using mpv player.""" if not is_installed("mpv"): raise ValueError( "mpv not found, necessary to stream audio. " "Install instructions: https://mpv.io/installation/" ) mpv_process = subprocess.Popen( ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"], stdin=subprocess.PIPE, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, ) print("Started streaming audio") async for chunk in audio_stream: if chunk: mpv_process.stdin.write(chunk) mpv_process.stdin.flush() if mpv_process.stdin: mpv_process.stdin.close() mpv_process.wait() async def text_to_speech_input_streaming(voice_id, text_iterator): """Send text to ElevenLabs API and stream the returned audio.""" uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_turbo_v2_5" async with websockets.connect(uri) as websocket: await websocket.send(json.dumps({ "text": " ", "voice_settings": {"stability": 0.5, "similarity_boost": 0.8}, "xi_api_key": ELEVENLABS_API_KEY, })) async def listen(): """Listen to the websocket for audio data and stream it.""" while True: try: message = await websocket.recv() data = json.loads(message) if data.get("audio"): yield base64.b64decode(data["audio"]) elif data.get('isFinal'): break except websockets.exceptions.ConnectionClosed: print("Connection closed") break listen_task = asyncio.create_task(stream(listen())) async for text in text_chunker(text_iterator): await websocket.send(json.dumps({"text": text})) await websocket.send(json.dumps({"text": ""})) await listen_task async def chat_completion(query): """Retrieve text from OpenAI and pass it to the text-to-speech function.""" response = await aclient.chat.completions.create(model='gpt-4', messages=[{'role': 'user', 'content': query}], temperature=1, stream=True) async def text_iterator(): async for chunk in response: delta = chunk.choices[0].delta yield delta.content await text_to_speech_input_streaming(VOICE_ID, text_iterator()) # Main execution if __name__ == "__main__": user_query = "Hello, tell me a very long story." asyncio.run(chat_completion(user_query)) ``` # Example - Other examples for interacting with our Websocket API Some examples for interacting with the Websocket API in different ways are provided below ```python Python websockets and asyncio import asyncio import websockets import json import base64 async def text_to_speech(voice_id): model = 'eleven_turbo_v2_5' uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}" async with websockets.connect(uri) as websocket: # Initialize the connection bos_message = { "text": " ", "voice_settings": { "stability": 0.5, "similarity_boost": 0.8 }, "xi_api_key": "api_key_here", # Replace with your API key } await websocket.send(json.dumps(bos_message)) # Send "Hello World" input input_message = { "text": "Hello World " } await websocket.send(json.dumps(input_message)) # Send EOS message with an empty string instead of a single space # as mentioned in the documentation eos_message = { "text": "" } await websocket.send(json.dumps(eos_message)) # Added a loop to handle server responses and print the data received while True: try: response = await websocket.recv() data = json.loads(response) print("Server response:", data) if data["audio"]: chunk = base64.b64decode(data["audio"]) print("Received audio chunk") else: print("No audio data in the response") break except websockets.exceptions.ConnectionClosed: print("Connection closed") break asyncio.get_event_loop().run_until_complete(text_to_speech("voice_id_here")) ``` ```javascript Javascript websockets const voiceId = "voice_id_here"; // replace with your voice_id const model = 'eleven_turbo_v2_5'; const wsUrl = `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input?model_id=${model}`; const socket = new WebSocket(wsUrl); // 2. Initialize the connection by sending the BOS message socket.onopen = function (event) { const bosMessage = { "text": " ", "voice_settings": { "stability": 0.5, "similarity_boost": 0.8 }, "xi_api_key": "api_key_here", // replace with your API key }; socket.send(JSON.stringify(bosMessage)); // 3. Send the input text message ("Hello World") const textMessage = { "text": "Hello World " }; socket.send(JSON.stringify(textMessage)); // 4. Send the EOS message with an empty string const eosMessage = { "text": "" }; socket.send(JSON.stringify(eosMessage)); }; // 5. Handle server responses socket.onmessage = function (event) { const response = JSON.parse(event.data); console.log("Server response:", response); if (response.audio) { // decode and handle the audio data (e.g., play it) const audioChunk = atob(response.audio); // decode base64 console.log("Received audio chunk"); } else { console.log("No audio data in the response"); } if (response.isFinal) { // the generation is complete } if (response.normalizedAlignment) { // use the alignment info if needed } }; // Handle errors socket.onerror = function (error) { console.error(`WebSocket Error: ${error}`); }; // Handle socket closing socket.onclose = function (event) { if (event.wasClean) { console.info(`Connection closed cleanly, code=${event.code}, reason=${event.reason}`); } else { console.warn('Connection died'); } }; ``` ```python elevenlabs-python from elevenlabs import generate, stream def text_stream(): yield "Hi there, I'm Eleven " yield "I'm a text to speech API " audio_stream = generate( text=text_stream(), voice="Nicole", model="eleven_turbo_v2_5", stream=True ) stream(audio_stream) ``` # Example - Getting word start times using alignment values This code example shows how the start times of words can be retrieved using the alignment values returned from our API. ```python import asyncio import websockets import json import base64 # Define API keys and voice ID ELEVENLABS_API_KEY = "INSERT HERE" <- INSERT YOUR API KEY HERE VOICE_ID = 'nPczCjzI2devNBz1zQrb' #Brian def calculate_word_start_times(alignment_info): # Alignment start times are indexed from the start of the audio chunk that generated them # In order to analyse runtime over the entire response we keep a cumulative count of played audio full_alignment = {'chars': [], 'charStartTimesMs': [], 'charDurationsMs': []} cumulative_run_time = 0 for old_dict in alignment_info: full_alignment['chars'].extend([" "] + old_dict['chars']) full_alignment['charDurationsMs'].extend([old_dict['charStartTimesMs'][0]] + old_dict['charDurationsMs']) full_alignment['charStartTimesMs'].extend([0] + [time+cumulative_run_time for time in old_dict['charStartTimesMs']]) cumulative_run_time += sum(old_dict['charDurationsMs']) # We now have the start times of every character relative to the entire audio output zipped_start_times = list(zip(full_alignment['chars'], full_alignment['charStartTimesMs'])) # Get the start time of every character that appears after a space and match this to the word words = ''.join(full_alignment['chars']).split(" ") word_start_times = list(zip(words, [0] + [zipped_start_times[i+1][1] for (i, (a,b)) in enumerate(zipped_start_times) if a == ' '])) print(f"total duration:{cumulative_run_time}") print(word_start_times) async def text_to_speech_alignment_example(voice_id, text_to_send): """Send text to ElevenLabs API and stream the returned audio and alignment information.""" uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_turbo_v2_5" async with websockets.connect(uri) as websocket: await websocket.send(json.dumps({ "text": " ", "voice_settings": {"stability": 0.5, "similarity_boost": 0.8, "use_speaker_boost": False}, "generation_config": { "chunk_length_schedule": [120, 160, 250, 290] }, "xi_api_key": ELEVENLABS_API_KEY, })) async def text_iterator(text): """Split text into chunks to mimic streaming from an LLM or similar""" split_text = text.split(" ") words = 0 to_send = "" for chunk in split_text: to_send += chunk + ' ' words += 1 if words >= 10: print(to_send) yield to_send words = 0 to_send = "" yield to_send async def listen(): """Listen to the websocket for audio data and write it to a file.""" audio_chunks = [] alignment_info = [] received_final_chunk = False print("Listening for chunks from ElevenLabs...") while not received_final_chunk: try: message = await websocket.recv() data = json.loads(message) if data.get("audio"): audio_chunks.append(base64.b64decode(data["audio"])) if data.get("alignment"): alignment_info.append(data.get("alignment")) if data.get('isFinal'): received_final_chunk = True break except websockets.exceptions.ConnectionClosed: print("Connection closed") break print("Writing audio to file") with open("output_file.mp3", "wb") as f: f.write(b''.join(audio_chunks)) calculate_word_start_times(alignment_info) listen_task = asyncio.create_task(listen()) async for text in text_iterator(text_to_send): await websocket.send(json.dumps({"text": text})) await websocket.send(json.dumps({"text": " ", "flush": True})) await listen_task # Main execution if __name__ == "__main__": text_to_send = "The twilight sun cast its warm golden hues upon the vast rolling fields, saturating the landscape with an ethereal glow." asyncio.run(text_to_speech_alignment_example(VOICE_ID, text_to_send)) ``` # Understanding how our websockets buffer text Our websocket service incorporates a buffer system designed to optimize the Time To First Byte (TTFB) while maintaining high-quality streaming. All text sent to the websocket endpoint is added to this buffer and only when that buffer reaches a certain size is an audio generation attempted. This is because our model provides higher quality audio when the model has longer inputs, and can deduce more context about how the text should be delivered. The buffer ensures smooth audio data delivery and is automatically emptied with a final audio generation either when the stream is closed, or upon sending a `flush` command. We have advanced settings for changing the chunk schedule, which can improve latency at the cost of quality by generating audio more frequently with smaller text inputs. # Delete Existing Invitation delete /v1/workspace/invites Invalidates an existing email invitation. The invitation will still show up in the inbox it has been delivered to, but activating it to join the workspace won't work. This endpoint may only be called by workspace administrators. Workspaces are currently only available for Enterprise customers. To upgrade, [get in touch with our sales team](https://elevenlabs.io/enterprise). # Invite User post /v1/workspace/invites/add Sends an email invitation to join your workspace to the provided email. If the user doesn't have an account they will be prompted to create one. If the user accepts this invite they will be added as a user to your workspace and your subscription using one of your seats. This endpoint may only be called by workspace administrators. Workspaces are currently only available for Enterprise customers. To upgrade, [get in touch with our sales team](https://elevenlabs.io/enterprise). # Update Member post /v1/workspace/members Updates attributes of a workspace member. Apart from the email identifier, all parameters will remain unchanged unless specified. This endpoint may only be called by workspace administrators. Workspaces are currently only available for Enterprise customers. To upgrade, [get in touch with our sales team](https://elevenlabs.io/enterprise). # Product Updates New updates and improvements ## API Updates * **u-law Audio Formats**: Added u-law audio formats to the Convai API for integrations with Twilio. * **TTS Websocket Improvements**: TTS websocket improvements, flushes and generation work more intuitively now. * **TTS Websocket Auto Mode**: A more streamlined mode for using websockets. This setting focuses on reducing the latency by disabling the chunk schedule and all buffers. It is only recommended when sending full sentences, sending partial sentences will result in highly reduced quality. * **Improvements to latency consistency**: Improvements to latency consistency for all models. ## Website Updates * **TTS Redesign**: The website TTS redesign is now in alpha! ## API Updates * **Normalize Text with the API**: Added the option normalize the input text in the TTS API. The new parameter is called `apply_text_normalization` and works on all non-turbo models. ## Feature Additions * **Voice Design**: The Voice Design feature is now in beta! ## Model Updates * **Stability Improvements**: Significant improvements in the audio stability of all models, but especially noticeable on `turbo_v2` and `turbo_v2.5`, when using: * Websockets * Projects * Reader app * TTS with request stitching * ConvAI * **Latency Improvements**: Time to first byte latency improvements by around 20-30ms for all models. ## API Updates * **Remove Background Noise Voice Samples**: Added the ability to remove background noise from voice samples using our audio isolation model to improve quality for IVCs and PVCs at no additional cost. * **Remove Background Noise STS Input**: Added the ability to remove background noise from STS audio input using our audio isolation model to improve quality at no additional cost. ### Feature Additions * **Conversational AI Beta**: The conversational AI feature is now in beta! # Delete Agent delete /v1/convai/agents/{agent_id} Delete an agent # Get Agent get /v1/convai/agents/{agent_id} Retrieve config for an agent # Get Agents get /v1/convai/agents Returns a page of your agents and their metadata. # Get Conversations get /v1/convai/conversations Get all conversations of agents that user owns. With option to restrict to a specific agent. # Get Conversation Audio get /v1/convai/conversations/{conversation_id}/audio Get the audio recording of a particular conversation # Get Conversation Details get /v1/convai/conversations/{conversation_id} Get the details of a particular conversation # Get Knowledge Base Document get /v1/convai/agents/{agent_id}/knowledge-base/{documentation_id} Get details about a specific documentation making up the agent's knowledge base # Get Signed URL get /v1/convai/conversation/get_signed_url Get a signed url to start a conversation with an agent with an agent that requires authorization # Get Widget get /v1/convai/agents/{agent_id}/widget Retrieve the widget configuration for an agent # Update Agent patch /v1/convai/agents/{agent_id} Patches an Agent settings # Create Agent post /v1/convai/agents/create Create an agent from a config object # Create Knowledge Base Document post /v1/convai/agents/{agent_id}/add-to-knowledge-base Uploads a file or reference a webpage for the agent to use as part of it's knowledge base # Create Agent Avatar post /v1/convai/agents/{agent_id}/avatar Sets the avatar for an agent displayed in the widget # WebSocket Create real-time, interactive voice conversations with AI agents This documentation is for developers integrating directly with the ElevenLabs WebSocket API. For convenience, consider using [the official SDKs provided by ElevenLabs](/conversational-ai/docs/introduction). The ElevenLabs [Conversational AI](https://elevenlabs.io/conversational-ai) WebSocket API enables real-time, interactive voice conversations with AI agents. By establishing a WebSocket connection, you can send audio input and receive audio responses in real-time, creating life-like conversational experiences. Endpoint: `wss://api.elevenlabs.io/v1/convai/conversation?agent_id={agent_id}` ## Authentication ### Using Agent ID For public agents, you can directly use the `agent_id` in the WebSocket URL without additional authentication: ```bash wss://api.elevenlabs.io/v1/convai/conversation?agent_id= ``` ### Using a Signed URL For private agents or conversations requiring authorization, obtain a signed URL from your server, which securely communicates with the ElevenLabs API using your API key. ### Example using cURL **Request:** ```bash curl -X GET "https://api.elevenlabs.io/v1/convai/conversation/get_signed_url?agent_id=" \ -H "xi-api-key: " ``` **Response:** ```json { "signed_url": "wss://api.elevenlabs.io/v1/convai/conversation?agent_id=&token=" } ``` Never expose your ElevenLabs API key on the client side. ## Communication ### Client-to-Server Messages #### User Audio Chunk Send audio data from the user to the server. **Format:** ```json { "user_audio_chunk": "" } ``` **Notes:** * **Audio Format Requirements:** * PCM 16-bit mono format * Base64 encoded * Sample rate of 16,000 Hz * **Recommended Chunk Duration:** * Send audio chunks approximately every **250 milliseconds (0.25 seconds)** * This equates to chunks of about **4,000 samples** at a 16,000 Hz sample rate * **Optimizing Latency and Efficiency:** * **Balance Latency and Efficiency:** Sending audio chunks every 250 milliseconds offers a good trade-off between responsiveness and network overhead. * **Adjust Based on Needs:** * *Lower Latency Requirements:* Decrease the chunk duration to send smaller chunks more frequently. * *Higher Efficiency Requirements:* Increase the chunk duration to send larger chunks less frequently. * **Network Conditions:** Adapt the chunk size if you experience network constraints or variability. #### Pong Message Respond to server `ping` messages by sending a `pong` message, ensuring the `event_id` matches the one received in the `ping` message. **Format:** ```json { "type": "pong", "event_id": 12345 } ``` ### Server-to-Client Messages #### conversation\_initiation\_metadata Provides initial metadata about the conversation. **Format:** ```json { "type": "conversation_initiation_metadata", "conversation_initiation_metadata_event": { "conversation_id": "conv_123456789", "agent_output_audio_format": "pcm_16000" } } ``` ### Other Server-to-Client Messages | Type | Purpose | | ---------------- | --------------------------------------------------- | | user\_transcript | Transcriptions of the user's speech | | agent\_response | Agent's textual response | | audio | Chunks of the agent's audio response | | interruption | Indicates that the agent's response was interrupted | | ping | Server pings to measure latency | ##### Message Formats **user\_transcript:** ```json { "type": "user_transcript", "user_transcription_event": { "user_transcript": "Hello, how are you today?" } } ``` **agent\_response:** ```json { "type": "agent_response", "agent_response_event": { "agent_response": "Hello! I'm doing well, thank you for asking. How can I assist you today?" } } ``` **audio:** ```json { "type": "audio", "audio_event": { "audio_base_64": "SGVsbG8sIHRoaXMgaXMgYSBzYW1wbGUgYXVkaW8gY2h1bms=", "event_id": 67890 } } ``` **interruption:** ```json { "type": "interruption", "interruption_event": { "event_id": 54321 } } ``` **internal\_tentative\_agent\_response:** ```json { "type": "internal_tentative_agent_response", "tentative_agent_response_internal_event": { "tentative_agent_response": "I'm thinking about how to respond..." } } ``` **ping:** ```json { "type": "ping", "ping_event": { "event_id": 13579, "ping_ms": 50 } } ``` ## Latency Management To ensure smooth conversations, implement these strategies: * **Adaptive Buffering:** Adjust audio buffering based on network conditions. * **Jitter Buffer:** Implement a jitter buffer to smooth out variations in packet arrival times. * **Ping-Pong Monitoring:** Use ping and pong events to measure round-trip time and adjust accordingly. ## Security Best Practices * Rotate API keys regularly and use environment variables to store them. * Implement rate limiting to prevent abuse. * Clearly explain the intention when prompting users for microphone access. * Optimized Chunking: Tweak the audio chunk duration to balance latency and efficiency. ## Additional Resources * [ElevenLabs Conversational AI Documentation](https://elevenlabs.io/docs/conversational-ai/overview) * [ElevenLabs Conversational AI SDKs](https://elevenlabs.io/docs/conversational-ai/client-sdk) # Custom LLM Integration Guide for using your own LLM or server with ElevenLabs SDK. ## Using Your Own OpenAI Key for LLM To integrate a custom OpenAI key, create a secret containing your OPENAI\_API\_KEY: Navigate to the "Secrets" page and select "Add Secret" Choose "Custom LLM" from the dropdown menu. Enter the URL, your model, and the secret you created. ## Custom LLM Server To bring a custom LLM server, set up a compatible server endpoint using OpenAI's style, specifically targeting create\_chat\_completion. Here's an example server implementation using FastAPI and OpenAI's Python SDK: ```python python import json import os import fastapi from fastapi.responses import StreamingResponse from openai import AsyncOpenAI import uvicorn import logging from dotenv import load_dotenv from pydantic import BaseModel from typing import List, Optional # Load environment variables from .env file load_dotenv() # Retrieve API key from environment OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') if not OPENAI_API_KEY: raise ValueError("OPENAI_API_KEY not found in environment variables") app = fastapi.FastAPI() oai_client = AsyncOpenAI(api_key=OPENAI_API_KEY) class Message(BaseModel): role: str content: str class ChatCompletionRequest(BaseModel): messages: List[Message] model: str temperature: Optional[float] = 0.7 max_tokens: Optional[int] = None stream: Optional[bool] = False user_id: Optional[str] = None @app.post("/v1/chat/completions") async def create_chat_completion(request: ChatCompletionRequest) -> StreamingResponse: oai_request = request.dict(exclude_none=True) if "user_id" in oai_request: oai_request["user"] = oai_request.pop("user_id") chat_completion_coroutine = await oai_client.chat.completions.create(**oai_request) async def event_stream(): try: async for chunk in chat_completion_coroutine: yield f"data: {json.dumps(chunk)}\n\n" yield "data: [DONE]\n\n" except Exception as e: logging.error("An error occurred: %s", str(e)) yield f"data: {json.dumps({'error': 'Internal error occurred!'})}\n\n" return StreamingResponse(event_stream(), media_type="text/event-stream") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8013) ``` Run this code or your own server code. ### Setting Up a Public URL for Your Server To make your server accessible, create a public URL using a tunneling tool like ngrok: ```shell ngrok http --url=.ngrok.app 8013 ``` ### Configuring Elevenlabs CustomLLM Now let's make the changes in Elevenlabs Direct your server URL to ngrok endpoint, also setup "Limit token usage" to 5000. You can start interacting with Conversational AI with your own LLM server # Knowledge Base Learn how to enhance your conversational agent with custom knowledge Knowledge bases allow you to provide additional context to your conversational agent beyond its base LLM knowledge. Non-enterprise users can add up to 5 files/links (max 20MB, 300,000 characters total). ## Adding Knowledge Items There are 3 options to enhance your conversational agent's knowledge: ### 1. File Upload File upload interface showing supported formats (PDF, TXT, DOCX, HTML, EPUB) with a 21MB size limit ### 2. URL Import URL import interface where users can paste documentation links Ensure you have permission to use the content from the URLs you provide ### 3. Direct Text Input Text input interface where users can name and add custom content ## Best Practices Provide clear, well-structured information that's relevant to your agent's purpose Break large documents into smaller, focused pieces for better processing ## Enterprise Features Need higher limits? Contact our sales team to discuss enterprise plans with expanded knowledge base capabilities. # Tools Provide your agent with real time information and the ability to take action in third party apps with external function calls. Tools allow you to make external function calls to third party apps so you can get real-time information. You might use tools to: Schedule appointments and manage availability on someone's calendar Book restaurant reservations and manage dining arrangements Create or update customer records in a CRM system Get inventory data to make product recommendations To help you get started with Tools, we'll walk through an "AI receptionist" we created by integrating with the Cal.com API.