Stream builds multimodal AI agents with ElevenLabs

Ostatnia aktualizacja 1 gru 2025 • 3 minut czytania

Integrating ElevenLabs Text to Speech cut setup time by 10x for developers building with voice

Stream has introduced Vision Agents - an open-source framework that enables developers to build low-latency, multimodal AI experiences combining real-time video, audio, and conversation. The framework integrates ElevenLabs Text to Speech to power expressive, responsive voices that enable seamless interaction between users and AI systems.

Enabling real-time, multimodal agents

Vision Agents gives AI the ability to see, hear, and respond in real time. Built on Stream’s video and audio SDKs, the framework provides a low-latency foundation for developers to prototype and deploy multimodal agent experiences.

When evaluating Text to Speech providers, Stream selected ElevenLabs for its market-leading quality and ease of integration - ElevenLabs now serves as the primary voice option for Stream’s users.

“ElevenLabs made it easy for us to quickly bring powerful text-to-speech capabilities to our SDK, allowing Agents to respond in real time with expressive voices to user questions or as feedback to what it’s seeing.” - Neevash Ramdial, Director of Marketing, Stream

Fast, reliable, and developer-friendly integration

Stream integrated ElevenLabs across its codebase in just a few days, enabling developers to add lifelike voice output to their vision agents with minimal configuration. The integration now delivers:

10x faster setup - Pre-integration with ElevenLabs reduces voice setup time from 400 lines of code to just 40.
Low-latency performance - ElevenLabs’ fast voice generation, combined with Stream’s global edge network, ensures responsiveness that feels natural and human.
Scalable developer experience - Stream’s SDKs simplify the process of creating, testing, and deploying multimodal agents.

Building the future of multimodal AI

Stream’s Vision Agents demonstrate how ElevenLabs models are expanding what’s possible in multimodal AI. By combining visual understanding with Text to Speech, developers can create agents that not only see, but also speak and listen with near-human fluency.

Looking to build with Text to Speech? Get in touch here.

Przeglądaj artykuły zespołu ElevenLabs

Creative Platform Stories

ElevenLabs and FL Studio partner to advance AI-powered workflows for music producers

Exploring how AI audio can support the creative process

Developer

Developer

Add a Santa Voice Agent to Your React App in Minutes

Build a real-time Santa Claus AI voice agent in your React app using ElevenLabs. Follow this quick step-by-step guide to create a festive, fully interactive holiday voice experience with WebRTC and the ElevenLabs Agents Platform.

Twórz z najwyższą jakością dźwięku AI

Zacznij za darmo

Masz już konto? Zaloguj się

Napędzane przez ElevenLabs Agenci