Introducing speech to speech

Nov 22, 2023 • 8 minutes reading time

Say it how you want it and transform your voice into another character with full control over emotions, timing, and delivery

We’ve added Speech to Speech (STS) to Speech Synthesis. STS is a voice conversion tool that lets you turn the recording of one voice to sound as if spoken by another. It lets you control the emotions, tone, and pronunciation beyond what's possible with TTS prompts alone. Use it to extract more emotions from a particular voice or as a 'say it how you want it' reference.

In other updates, we’re making changes to our premade voices. Also, we made a number of improvements to Studio, including added normalization, pronunciation dictionary, and more customization options.

Voice Changer

A voice command icon, a yellow circle with a right arrow, and an abstract yellow and orange wave design.

Say it how you want it and hear it delivered in another voice with full control over the delivery

Speech to speech

STS takes the the content and style of speech contained in your upload / recording and changes the voice. Think of STS as useful primarily for two things.

One is to extract more emotions from a particular premade voice. Upload / record highly expressive speech and STS will replicate the emotions and intonation in another voice. Since not all voices can be made to express strong emotions with TTS prompts alone, you can now make a professional narrator or a children’s book character more expressive with your own voice.

Another use for STS is providing a ‘reference’ for speech delivery. While our TTS typically nails the intonation straight away, you may sometimes wish to fine-tune it. Here, STS lets you demonstrate how to intonate a particular phrase and then have any voice you choose say it like that. This functionality will become more immediately useful and streamlined once we integrate STS directly into Studio, but our aim here is to radically improve your ability to edit output precisely.

Watch the video created by one of our community members:

Research

To convert source speech into target speech, we need to express source speech content with target speech characteristics. A good analogy would be the face-swapping apps which let you mix your face with somebody else’s to create a picture of both as one.

The way to go about this is to take the image of a face and map its attributes. The markers in the example below do just that: they’re the limits inside which the other face would be rendered.

Comparison of facial recognition and facial mapping technology.

Audio waveform with a corresponding speech transcription in a visual format.

The trick in voice conversion is to render source speech content using target speech phonemes. But there’s a tradeoff here, much as there is in the face-swapping example: the more markers you use to map one face's attributes, the more constraints you impose on the face you map inside them. Fewer markers means less constraints.

The same is true of voice conversion. The more preference we give to target speech, the more we risk becoming out of sync with source speech. But if we don't give it enough preference, we risk losing much of what makes that speech characteristic. For example, if we were to render the recording of somebody shouting angrily in a whispery voice, we’d be in trouble. Give too much preference to source speech emotions and the price we pay is losing the impression it's a whispery voice speaking. Too much emphasis on the whispery speech pattern and we lose the emotional charge of source speech.

Product & recent updates

Changes to premade voices

We'll be making changes to default voices available in Speech Synthesis later this week. We'll stop supporting a few voices, but we'll replace them with new ones. We plan to add over 20 in total in the coming weeks.

We will also start providing UI information on how long each voice is expected to be available. Finally, throughout December we'll work on revamping our platform’s voice sharing and usage compensation features to further improve voice variety. More details on this soon.

Eleven Turbo v2 & uLaw 8khz format

Our Turbo model packs months of research from our tech team. It’s designed for realtime interactions but can be used for anything you want. It also comes with the standard (m)uLaw 8kHz format for IVR systems.

Normalisation & metadata with Studio

You can now follow industry-standard audiobook submission guidelines within Studio. This includes adjusting gain and applying dynamic compression. Additionally, there is now the option to embed metadata in your Studio (ISBN, author, and title).

Pronunciation diary

Adding a Pronunciation Dictionary has been one of our most requested features. Last month we implemented the addition of SSML tags for specifying pronunciation using the IPA and CMU dictionaries for our English models. We've now released pronunciation dictionary support to our Studio UI, allowing you to upload a file specifying the pronunciation using IPA, CMU or word substitutions. Dictionary files are uploading using the industry standard and open .PLS lexicon file format.

For now, IPA and CMU are supported by Turbo V2 English, and word substitutions (aliases) are supported by all models and languages. Full docs can be found here.