How it went
We've just come back from this year’s INTERSPEECH conference which was the best opportunity we've had so far to present and get feedback on all the developments we’ve been working on these past couple of months.
It’s been great to learn from and share ideas with the the best in the field and to forge future relations in the process. We met teams from some fantastic startups working in the same field as us, particularly on voice cloning, speech synthesis (TTS) and voice conversion (VC) (Supertone and LOVO to name but two). We were equally as excited to talk with some of the most well-established companies out there like Meta and Google about the behind-the-scenes work that goes into developing TTS and VC software.
We got straight to business. The amount of sincere enthusiasm for our work couldn't have made us happier - it all surpassed any expectations we had. Over the next four days we discussed our research and progress inside those three speech tech areas above - the absolutely crucial first steps on our way to developing our proprietary automatic dubbing tool, version 1.0 of which we plan to release early next year.
The most important thing for us here was to prove that we can faithfully clone voices - that we're able to preserve voice similarity between the source voice data on which we train our algorithm and and the way the same voice sounds when generated synthetically. And secondly, it was crucial for us to prove that our TTS tools are on track to becoming part of the most human- and natural-sounding synthetic speech platform out there by providing second to none prosody and tonality.
The former is naturally important since we need the newly generated utterances to be readily identifiable as spoken by a particular person - we need to correctly preserve speaker identity. Prosody and tonality are important because tone and pacing convey intent which is really what makes speech sound human in the first place. The holy grail here is for the program to not only pronounce words fluently but also to overlay the utterance with an appropriate emotional charge so that it sounds as if it understands what it's saying.
You can see one such TTS demo we used during the conference below. The first link is the original video and then our sample containing the same message spoken in a different voice follows. Mind you, this is text-to-speech - not voice conversion. Our only input was writing down the words spoken in the original video to generate the speech you hear. All prosody and intonation are down to the algorithm itself, there's no post-processing involved. See if you recognize whose voice it is!
You'll read more on Eleven TTS technology in our next entry dedicated specifically to generating speech from text input.
If you like our tech and you would like to become our beta-tester, you can sign up for this here.
Eleven Labs voice cloning TTS:
Content over form
In the months preceding the conference our efforts were focused almost exclusively on delivering demonstrable samples of our tech and on showing our proprietary research. After all, INTERSPEECH is a research conference and we were adamant that content must precede form especially at a gathering so specifically oriented. Come conference day though, we started joking that our heightened focus on tech perhaps made our branding efforts seem too minimalist. We were soon quite relieved, if not vindicated!, to find others, including the big players, opting for humbler set-ups, as well.
Until next year
Our Korea trip was a great success for Eleven and a big dose of motivation to push ever harder. We're already excited just thinking about the progress we can make over this next year both in our research and in ways of presenting it. Hopefully by then we'll have our production-quality dubbing tools ready and we'll be using people's voices to let them speak the languages they don't.