WebSockets

The Text-to-Speech WebSockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the WebSockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:

  • The input text is being streamed or generated in chunks.
  • Word-to-audio alignment information is required.

However, it may not be the best choice when:

  • The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
  • You want to quickly experiment or prototype. Working with WebSockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.

HandshakeTry it

GET
/v1/text-to-speech/:voice_id/stream-input

Headers

xi-api-keystringOptional

Path parameters

voice_idstringRequired

The unique identifier for the voice to use in the TTS process.

Send

Initialize ConnectionobjectRequired
OR
Send TextobjectRequired
OR
Close ConnectionobjectRequired

Receive

Audio OutputobjectRequired
OR
Final OutputobjectRequired