OpenAI Voice Assistant

May 13, 2024 • 5 minutes reading time

And its rumoured integration into Apple's iOS 18

OpenAI has been expanding its portfolio with new products, and one of the most talked about is their Voice Assistant technology. It's set to revolutionize how we interact with machines using voice, yet much about its broad deployment remains under wraps.

Allegedly, OpenAI is developing a technology that integrates audio, text, and image recognition capabilities into a single product. This technology could, for example, assist children with their math homework or provide users with practical information about their immediate environment, such as language translation or vehicle repair guidance.

What is OpenAI's Voice Assistant?

The rumoured Voice Assistant is designed to naturally interact with users through speech. It leverages advancements in Automatic Speech Recognition (ASR), Large Language Models (LLMs), and Text to Speech (TTS) systems. The integration of these technologies allows the Voice Assistant to understand spoken input, process the information contextually, and respond in a natural, human-like voice.

OpenAI is expected to demo a real-time voice assistant tomorrow. What does it take to deliver an immersive, or even magical experience?

Almost all voice AI go through 3 stages:
1. Speech recognition or "ASR": audio -> text1, think Whisper;
2. LLM that plans what to say next:… pic.twitter.com/q41KlGKM42
— Jim Fan (@DrJimFan) May 12, 2024

Almost all voice AI systems follow three steps:

Speech Recognition ("ASR"): This converts spoken audio to text. An example technology is Whisper.
Language Model Processing: Here, a language model determines the appropriate response, transforming the initial text to a response text.
Speech Synthesis ("TTS"): This step converts the response text back into spoken audio, with technologies like ElevenLabs or VALL-E as examples.

Adhering strictly to these three stages can lead to significant delays. If users have to wait five seconds for each response, the interaction becomes cumbersome and unnatural, diminishing the user experience even if the audio sounds realistic.

Effective natural dialogue doesn't operate sequentially:

We think, listen, and speak simultaneously.
We naturally interject affirmations like "yes" or "hmm."
We anticipate when someone will finish talking and respond immediately.
We can interrupt or talk over someone in a non-offensive way.
We handle interruptions smoothly.
We can engage in conversations involving multiple people effortlessly.

Enhancing real-time dialogue isn't just about speeding up each neural network process; it requires a fundamental redesign of the entire system. We need to maximize the overlap of these components and learn to make real-time adjustments effectively.

OpenAI seems to be working on having phone calls inside of chatGPT. This is probably going to be a small part of the event announced on Monday.
(1/n) pic.twitter.com/KT8Hb54DwA
— Ananay (@ananayarora) May 11, 2024

Applications and potential integration with Apple's iOS

Apparently, the Apple - OpenAI deal just closed! One day before the voice assistant announcement :)

Guess Apple decided that it couldn't make it on its own 🤷

The new Siri will be from OpenAI pic.twitter.com/Yfr6oCJiwQ
— Bindu Reddy (@bindureddy) May 13, 2024

The potential applications of this technology are vast, ranging from personal and business uses to helping community health workers provide better services by interacting in local languages or aiding individuals with speech impairments.

Rumors suggest that this technology could potentially be integrated into systems like Apple's iOS, offering a more seamless and interactive user experience than Siri. However, details on such collaborations or the full capabilities of the Voice Assistant have not been officially confirmed.

ElevenLabs Voice AI

One thing that is certain to feature in any advanced voice assistant is cutting-edge voice AI. ElevenLabs models combine proprietary methods for context awareness and high compression to deliver ultra-realistic, lifelike speech across a range of emotions and languages. Our contextual text to speech model is built to understand word relationships and adjusts delivery based on context. It also has no hardcoded features, meaning it can dynamically predict thousands of voice characteristics while generating speech. Our models are optimised for particular applications, such as long-form and multilingual speech generation or latency-sensitive tasks.