Meet Eleven Music. Make the perfect song for any moment.

OpenAI voice: use pictures and voice commands in ChatGPT

Sep 1, 2023 • 11 minutes reading time

Converse with ChatGPT using your own voice

A smartphone displaying a holographic microphone with voice command icons and digital sound waves.

Ever found yourself musing over the possibility of conversing with ChatGPT using your own voice or sharing images with it? It appears your visionary dreams are on the brink of reality.

OpenAI's ground-breaking advancements usher in a groundbreaking era where voice and imagery coalesce, enabling ChatGPT to resonate not just with your keystrokes but also with your spoken words and shared visuals.

Picture yourself meandering past an architectural marvel and diving into an animated conversation about its history or orchestrating a culinary discussion inspired by a snapshot of your refrigerator's interior.

Thanks to the integration of a state-of-the-art text-to-speech model, engagements with ChatGPT evolve from mere interactions to immersive dialogues. It transcends traditional querying, offering a platform for fluid conversations, be it for a whimsical bedtime story or resolving a culinary quandary.

This is the dawn of an era where voice, vision, and virtual intellect fuse seamlessly.

So, can you talk to ChatGPT?

Yes, you can. Read on to discover how.

Article summary

What is OpenAI voice?
Everything you can do with OpenAI voice
OpenAI voice limitations
Generative voice AI

What is OpenAI voice?

OpenAI Voice is a cutting-edge technology that makes AI-based conversations sound more human-like. A significant component of its success is attributed to the Whisper model.

Whisper is an automatic speech recognition system that's been trained on a vast amount of data — around 680,000 hours of multilingual content from the web.

This extensive training allows it to understand a wide range of accents, adapt to background noises, and grasp technical language. The system is also adept at translating various languages into English.

The way Whisper works is quite straightforward. When it receives audio input, it divides it into 30-second segments. These segments are then transformed into a format called a log-Mel spectrogram.

Simply put, a log-Mel spectrogram is a visual representation of the spectrum of frequencies in a sound signal as they change with time. It highlights the melodic patterns in the audio, making it easier for the system to analyze and process the information.

After this transformation, an encoder processes the data, and a decoder predicts the corresponding text. This process also includes special indicators or tokens that can identify languages and even translate speech into English.

It's worth noting that while many existing models rely on specific, limited datasets, Whisper's strength comes from its broad and diverse training.

Although it might not always outperform models designed for very specific tasks, its wide-ranging training means it's versatile and can handle a broader spectrum of challenges.

For example, it can understand and convert a significant amount of non-English audio content, either retaining the original language or translating it to English.

So, when the ChatGPT voice assistant reads a bedtime story or answers a question, it's leveraging the power of Whisper. This combination ensures interactions that are both natural and informed, bridging the gap between AI and human conversation.

Everything you can do with OpenAI voice

The ChatGPT voice generator is not merely a technological tool, it's a gateway to immersive, multi-sensory experiences that make digital interactions more intuitive and encompassing.

Let's delve into its expansive capabilities:

Speak questions to ChatGPT

Gone are the days when interactions with ChatGPT were limited to typing. Now, striking up a conversation is as simple as:

Opening the ChatGPT app and logging in with your OpenAI Account.
Tapping on 'new question'.
Selecting the headphone icon.
Choosing a preferred voice.
Voicing out your query.
Waiting a moment to receive a vocally articulated response.

Imagine casually asking, "Tell me about the Renaissance period?" and having a nuanced, articulate reply echoed back.

This dynamic offers more than just answers. It provides an experience of human-like discourse with an AI.

Screenshots of a voice selection and calling interface on a mobile device, showing options to choose a voice, a calling screen with a large circle, and a call in progress with options to pause or end the call.

Text-to-speech model

OpenAI's new voice technology heralds an era of auditory diversity. From the tranquil tones of a baritone to the vibrant pitches of a soprano, OpenAI Voice encapsulates a spectrum of voices.

Beyond mere replication, this technology crafts synthetic voices that bear an uncanny resemblance to genuine human speech, enhancing authenticity in interactions.

However, it's important to note that while the potential applications are vast, they come with ethical considerations. The precision of voice synthesis, though remarkable, could be misused for deceit or impersonation.

OpenAI acknowledges these challenges and has actively taken measures to mitigate misuse, primarily by focusing on specific, beneficial use cases, like voice chat.

Image input

The ability to "see" and comprehend visual information pushes OpenAI Voice into a new frontier. But interpreting images is more than just understanding content; it's about ensuring safety and privacy and, at the same time, providing the same level of insight as a human being with knowledge on the subject.

OpenAI's work with 'Be My Eyes,' an app designed to assist blind and low-vision individuals, has been instrumental in shaping this vision capability.

For instance, a user might share an image of their TV settings, and OpenAI Voice can assist, even if there's a person in the background.

To ensure individual privacy, OpenAI has implemented measures to limit direct analysis of people within images, emphasizing the importance of both utility and ethical considerations.

Three screenshots of a mobile app displaying text-based answers to questions about a car, a building, and a skyscraper, with images of a Suzuki Jimny, the Palace of Westminster, and the Burj Khalifa.

Images used: Pexels, Pexels, Pexels

Translating podcasts

In collaboration with Spotify, OpenAI Voice is set to redefine the podcasting landscape.

By harnessing OpenAI's voice generation technology, Spotify aims to offer podcast translations that aren't just linguistically accurate but also emotionally congruent. Imagine listening to a podcast originally in English, now available in multiple languages, all while preserving the unique nuances of the original speaker.

This goes way beyond mere translation. It represents a recreation that ensures listeners across the globe can connect deeply with the content.

OpenAI voice limitations

While OpenAI Voice stands as a beacon of innovation in the realm of AI interactions, it's vital to understand that, like all technological marvels, it comes with its own set of limitations:

Image recognition and safety:

Vision, as embedded in ChatGPT, primarily aims to enhance daily life interactions, functioning optimally when interpreting what users visually encounter. Collaborations with platforms like 'Be My Eyes' have enriched OpenAI's perspective on visual capabilities, making it sensitive to the needs of the visually impaired.

For instance, users might share an image of a crowded park to inquire about plant species, even though there are people in the distance enjoying a picnic.

This vision feature is not infallible, however. OpenAI has incorporated measures to limit ChatGPT's scope in making definitive remarks about individuals within images, given that the model's accuracy can vary and the paramount need to uphold individual privacy.

As real-world feedback pours in, the emphasis is on refining these protective measures, ensuring a balance between functionality and safety. To dive deeper into the intricacies of image input, this study based on the system card offers invaluable insights.

Specialized topics:

OpenAI Voice, while impressive, is not a substitute for expert advice, especially in specialized sectors like research or medical advice. Users are encouraged to approach such high-risk topics with caution, always seeking verification before relying on the model's output.

Language proficiency:

Although adept at transcribing English text, OpenAI Voice's proficiency wanes with certain non-English languages, particularly those employing non-roman scripts. Consequently, non-English users are advised to exercise caution when using the text-to-speech feature in such languages.

Voice cloning concerns:

The capability to generate near-perfect synthetic voices, while groundbreaking, comes with the shadow of potential misuse. Impersonation and fraudulent activities are concerns that users must be aware of, underscoring the importance of ethical and informed usage.

While OpenAI Voice offers a plethora of opportunities to enhance digital interactions, recognizing its boundaries is crucial to harnessing its potential responsibly.

Generative voice AI

In a world inundated with digital voices, true innovation lies not just in mimicking speech but in crafting personalized auditory experiences.

The true pioneers in this space are those who look beyond mere language barriers to bridge emotional and cultural divides.

ElevenLabs, with its cutting-edge approach to voice synthesis, emerges as a true game-changer in this domain.

Bridging global narratives with ElevenLabs

Voice synthesis, at its core, is about communication. But for ElevenLabs, it's a commitment to global resonance. Their advanced multilingual AI technology ensures content doesn't merely reach audiences but truly connects with them, regardless of geographical boundaries.

With capabilities to offer text to speech in 70+ languages, ElevenLabs' AI goes beyond generic text-to-speech solutions. It harnesses deep learning to produce speech that's clear, emotionally charged, and culturally in tune.

TEXT TO SPEECH

A blue sphere with a black arrow pointing to the right, next to a white card with a blue and black abstract wave design.

Our AI text to speech technology delivers thousands of high-quality, human-like voices in 70+ languages. Whether you’re looking for a free text to speech solution or a premium voice AI generator for commercial projects, our TTS tools & APIs can meet your needs

Elevenlabs ensures the narrative remains authentic, encapsulating linguistic subtleties and regional nuances.

The true marvel, however, lies in the seamless integration of Professional Voice Cloning with the Multilingual TTS model. Once you've forged a digital replica of a voice with ElevenLabs, it can articulate content in any of the supported languages.

The best part is that your unique voice characteristics remain intact.

Imagine articulating in languages unfamiliar to you yet retaining your authentic vocal signature. It's the promise of global communication without losing individuality.

Navigating the ethical landscape of voice cloning

Voice cloning, the digital imitation of an individual's voice, is a double-edged sword. While it holds immense potential, ethical considerations are paramount.

With ElevenLabs, voice cloning is transformed into a safe, transparent process. By uploading a recorded voice, users can craft its digital counterpart, paving the way for new speech generation. However, safety protocols are rigorous.

Voice cloning is safest when it's personal: using one's voice and content. If leveraging someone else’s voice, permission is paramount.

Without consent, non-commercial purposes have a narrow window, and even then, the emphasis lies on ensuring privacy and respecting individual rights. Activities such as private study, satire, or artistic expression are permissible.

However, cloning voices for malicious intents, be it fraud or hate speech, is a strict no-go. Such actions aren't just against ElevenLabs' principles but might also attract legal consequences.

To delve deeper into the best practices and the nuances of voice cloning, ElevenLabs provides insights on how to safely use voice cloning.

While the horizons of voice AI continue to expand, companies like ElevenLabs set the gold standard by marrying innovation with responsibility.

Elevenlabs is building a world where voices are not just heard but genuinely understood across borders and beyond barriers.

VOICE CLONING

A blue and silver abstract spherical shape next to a gray microphone icon.

Automate video voiceovers, ad reads, podcasts, and more, in your own voice

FAQ

OpenAI Voice is a groundbreaking voice synthesis technology developed by OpenAI. It enables more human-like conversations with AI, allowing users to vocally interact with ChatGPT and receive auditory responses. The system is backed by Whisper, an automatic speech recognition system, ensuring robustness and versatility in understanding and replicating human speech.

OpenAI Voice goes beyond just answering queries. By leveraging the vast training data and the Whisper model, it can understand intricate nuances in voice, from accents to emotional undertones. Its integration with image recognition means it's not just listening but also "seeing" and comprehending visual information, making it a multi-sensory AI companion.

Yes, OpenAI acknowledges potential risks, especially with image recognition in high-risk domains and the misuse of voice cloning. Measures have been put in place to limit the system's scope in making definitive remarks about people within images. Users are also encouraged to be cautious with voice cloning, given the potential for impersonation and deceit.

ElevenLabs is pioneering in the realm of global speech synthesis. Their advanced multilingual AI technology ensures content doesn't just reach global audiences but truly resonates with them. With capabilities like "text to speech in 70+ languages", they break language barriers while preserving emotional and cultural authenticity. Furthermore, ElevenLabs integrates Professional Voice Cloning with their Multilingual TTS model, enabling a unique voice to articulate in multiple languages, offering a blend of global reach with personal touch.

Explore articles by the ElevenLabs team

Customer stories

Graydon Carter’s Air Mail, now in audio

We’re adding audio to Air Mail magazine, so readers can follow it anywhere

Company