Interaction models for natural human AI communication

Last updated May 12, 2026 • 2 minutes reading time

How we build AI systems that communicate in real time - covering the technical decisions behind turn-taking, latency, and expressive delivery, and the models we have shipped.

Interaction models are AI systems built to communicate with people in real time across audio, video, and text without the turn-based delays that define many chatbots today.

We have been building toward this category for years. This post lays out what we have shipped, and the research and product decisions behind it.

Our flagship product - ElevenAgents with v3 Conversational

Amanda, an ElevenAgents voice agent, handles an inbound emergency loan call - navigating a distressed customer from panic to resolution in under two minutes.

What it takes to make an interaction model work

Three things have to hold together for an interaction system to work well and create engaging natural interactions:

Sub-second response. ElevenAgents are optimized for sub-100ms turnaround on internal benchmarks, with sub-200ms targeted for telephony integrations. Flash v2.5, our fastest Text to Speech model, runs at ~75ms inference.*
Turn-taking that handles interruption. To prevent eager interruption you need a turn-taking system that takes into account both silences but also what’s being said.
Expressive, natural delivery. The model has to respond with the right tone, pacing, and emotion for the moment.

*Refers to model inference time only. Actual end-to-end latency will vary with factors such as your location and endpoint type used.

Some of what we have shipped

Eleven v3 - our most expressive Text to Speech model. Designed to shift register, laugh, and sigh with natural delivery.

Eleven v3 Conversational. Our conversational variant of v3, launched inside ElevenAgents in February 2026 with built-in turn-taking. The turn-taking model is enabled by default when v3 Conversational is selected as the TTS model.

Speculative turn-taking. A separate feature in v3 Conversational that pre-triggers LLM response generation during periods of user silence, reducing perceived latency.

Flash v2.5. Our fastest Text-to-Speech model, designed for low-latency real-time use, at ~75ms inference.*

Scribe v2. Our Speech-to-Text model with industry-leading accuracy.

ElevenAgents Expressive Mode. Lets agents use expressive tags such as [laughs], [whispers], [sighs], and [slow] to control delivery in context.

*Refers to model inference time only. Actual end-to-end latency will vary with factors such as your location and endpoint type used.