Combine Multiple Generations
Learn how to keep your voice stable across multiple generations
What is Request Stitching?
When one has a large text to convert into audio and sends the text in chunks without further context there can be abrupt changes in prosody from one chunk to another.
It would be much better to give the model context on what was already generated and what will be generated in the future, this is exactly what Request Stitching does.
As you can see below the difference between not using Request Stitching and using it is subtle but noticeable:
Without Request Stitching:
With Request Stitching:
Conditioning on text
We will use Pydub for concatenating multiple audios together, you can install it using:
One of the two ways on how to give the model context is to provide the text before and / or after the current chunk by using the ‘previous_text’ and ‘next_text’ parameters:
Conditioning on past generations
Text conditioning works well when there has been no previous or next chunks generated yet. If there have been however, it works much better to provide the actual past generations to the model instead of just the text. This is done by using the previous_request_ids and next_request_ids parameters.
Every text-to-speech request has an associated request-id which is obtained by reading from the response header. Below is an example on how to use this request_id in order to condition requests on the previous generations.
Note that the order matters here: When one converts a text split into 5 chunks and has already converted chunks 1, 2, 4 and 5 and now wants to convert chunk 3 the previous_request_ids one neeeds to send would be [request_id_chunk_1, request_id_chunk_2] and the next_request_ids would be [request_id_chunk_4, request_id_chunk_5].
Conditioning both on text and past generations
The best possible results are achieved when conditioning both on text and past generations so lets combine the two by providing previous_text, next_text and previous_request_ids in one request:
Things to note
- Providing wrong previous_request_ids and next_request_ids will not result in an error.
- In order to use the request_id of a request for conditioning it needs to have processed completely. In case of streaming this means the audio has to be read completely from the response body.
- How well Request Stitching works varies greatly dependent on the model, voice and voice settings used.
- previous_request_ids and next_request_ids should contain request_ids which are not too old. When the request_ids are older than two hours it will diminish the effect of conditioning.
- Enterprises with increased privacy requirements will have Request Stitching disabled.