WebSockets
The Text-to-Speech WebSockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the WebSockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:
- The input text is being streamed or generated in chunks.
- Word-to-audio alignment information is required.
However, it may not be the best choice when:
- The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
- You want to quickly experiment or prototype. Working with WebSockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.
HandshakeTry it
GET
/v1/text-to-speech/:voice_id/stream-input
Headers
xi-api-key
Path parameters
voice_id
The unique identifier for the voice to use in the TTS process.
Send
Initialize Connection
OR
Send Text
OR
Close Connection
Receive
Audio Output
OR
Final Output