Eleven Musicを紹介します。どんな瞬間にもぴったりの曲を作りましょう。

詳細はこちら

How we engineered RAG to be 50% faster

プロダクションでのレイテンシーに敏感なRAGシステムからのヒント

rag

When building conversational agents, every millisecond counts. Users expect instant, natural responses - but Retrieval-Augmented Generation (RAG), while essential for accuracy with large knowledge bases, often introduces latency.

We recently deployed an optimization that cut our RAG query generation latency by 50%, reducing 50th percentile (p50) response time from 326ms to 155ms.

The challenge: context-aware query generation

RAG systems need to transform conversation history into precise search queries that capture the user’s intent. Consider this customer support example:

User: "What are the API rate limits for the Professional plan?" Agent:エージェント:「プロフェッショナルプランには、テキスト読み上げで毎分10,000リクエストと、リアルタイムストリーミング用に1,000の同時WebSocket接続が含まれています。」 "And what about the Enterprise tier?" ユーザー:「エンタープライズプランについてはどうですか?」User:エージェント:

The last question references "those limits," requiring context from the entire conversation. Before searching the knowledge base, the system must rewrite it into a self-contained query such as:

"Can Enterprise plan API rate limits be customized for specific traffic patterns?"

Previously, this required a synchronous call to a single LLM, creating a hard dependency on that model’s latency and availability.

The solution: parallel LLM racing with graceful fallbacks

Instead of relying on one LLM, we designed a system that sends multiple requests in parallel and uses the first successful response. We treat LLM query generation as a race where the fastest model wins.

Heterogeneous model mix

The strength of this approach comes from mixing models with complementary characteristics. Google’s Gemini models (2.0-flash-lite and 2.5-flash-lite) excel at speed, often responding in under 200ms during off-peak hours. Our self-hosted Qwen models (3-4B and 3-30B-A3B) run on our own infrastructure, giving us full control over costs and avoiding external rate limits.

Each model has different latency patterns throughout the day - Gemini may slow during peak hours while our self-hosted models remain steady. By racing all four simultaneously, we turn unpredictable individual performance into predictable system-wide behavior.

Smart timeout handling

Sometimes none of the models respond within our 1-second timeout. To keep conversations flowing, we use a fallback strategy: defaulting to the most recent user message as the query. While less precise than an LLM rewrite, this still works effectively for retrieval and prevents stalled responses.

This reflects a core principle: maintaining conversation flow is more important than perfect query optimization.

The results

The performance gains were significant across all percentiles:

  • Median latency dropped from 326ms to 155ms
  • 75th percentile improved from 436ms to 250ms
  • p95 latency improved from 629ms to 426ms

Beyond the speedups, the architecture improved reliability. When Gemini experienced an outage last month, our system continued operating seamlessly, with self-hosted models taking over. Since we already run this infrastructure for other services, the additional compute cost is negligible.

Most importantly, the system automatically adapts in real time, routing queries to whichever model is performing best without manual tuning.

Building voice AI with sub-200ms RAG

このアーキテクチャは、リアルタイムでコンテキストを理解するAIアシスタントへの一歩だと考えています。200ms未満のRAGでボイスエージェントを構築したい場合は、ElevenLabs Agents.

ElevenLabsチームによる記事をもっと見る

Safety

Safety framework for AI voice agents

AI voice agents are increasingly being used in customer service, entertainment, and enterprise applications. With this shift comes the need for clear safeguards to ensure responsible use.

ElevenLabs

最高品質のAIオーディオで制作を

無料で始める

すでにアカウントをお持ちですか? ログイン