
èªç€Ÿã®ããã¥ã¡ã³ãã®ããã«å¹æçãªé³å£°ãšãŒãžã§ã³ããæ§ç¯ãã
ãŠãŒã¶ãŒããã®åãåããã®80%以äžã解決
ã¬ã€ãã³ã·ãŒã¯ãè¯ãäŒè©±åAIã¢ããªã±ãŒã·ã§ã³ãšåªããã¢ããªã±ãŒã·ã§ã³ãšã®éããçã¿åºããŸãã
ã»ãšãã©ã®ã¢ããªã±ãŒã·ã§ã³ã§ã¯ãã¬ã€ãã³ã·ãŒïŒé 延ïŒã¯è»œåŸ®ãªåé¡ã§ããããããäŒè©±åAIã®å Žåãã¬ã€ãã³ã·ãŒãè¯ãäŒè©±åAIã¢ããªã±ãŒã·ã§ã³ãšåªããã¢ããªã±ãŒã·ã§ã³ãšãåããèŠçŽ ãšãªããŸãã
ãŸããäŒè©±å AI ã®ç®æšã¯ããªãéå¿çã§ããã€ãŸããç¥èœã«ãããŠäººéãè¶ ããªããã人éã®äŒè©±ãšåãæèŠãæ觊ã声ãæäŸããããšã§ãããããå®çŸããã«ã¯ãã¢ããªã±ãŒã·ã§ã³ã¯é·ãæ²é»ãšãã£ãã®ã£ãããçºçãããã«äŒè©±ããå¿ èŠããããŸããããã§ãªããã°ããªã¢ãªãºã ã¯åŽ©å£ããŸãã
äŒè©±åAIã®ã¬ã€ãã³ã·ãŒã®èª²é¡ã¯ããã®æççãªæ§è³ªã«ãã£ãŠããã«è€éã«ãªããŸããäŒè©±åAIã¯äžé£ã®äžéããã»ã¹ã§ããããããããããããã®åéã§æå 端ãšã¿ãªãããŠããŸãããããã®åããã»ã¹ã§ã¯ãå°ãã¥ã€é 延ãç©ã¿éãªããŸãã
çæé³å£°äŒç€ŸãšããŠãç§ãã¡ã¯äŒè©±åAIã®é 延ãæå°éã«æããæ¹æ³ãé·ãéç 究ããŠããŸãããä»æ¥ã¯ãäŒè©±åAIã¢ããªã±ãŒã·ã§ã³ã®æ§ç¯ã«é¢å¿ã®ãããã¹ãŠã®äººã«ãšã£ãŠåœ¹ç«ã€ããšãé¡ã£ãŠãç§ãã¡ãåŠãã ããšãå ±æããããšæããŸãã
ãã¹ãŠã®äŒè©±åAIã¢ããªã±ãŒã·ã§ã³ã«ã¯å°ãªããšã 4〠ã¹ããã: é³å£°ããã¹ãå€æãã¿ãŒã³ãã€ãã³ã°ïŒäŒè©±åãæ¿ãïŒãããã¹ãåŠç (LLM ãªã©)ãããã³ããã¹ãèªã¿äžãïŒTTS)ããããã®ã¹ãããã¯äžŠè¡ããŠå®è¡ãããŸãããåã¹ãããã§äŸç¶ãšããŠå€å°ã®é 延ãçºçããŸããtext-to-speech. While these steps are executed in-parallel, each step still contributes some latency.
ç¹ã«ãäŒè©±å AI ã®ã¬ã€ãã³ã·ãŒæ¹çšåŒã¯ç¬ç¹ã§ããå€ãã®ããã»ã¹é 延ã®åé¡ã¯ãåäžã®ããã«ããã¯ã«ãŸã§è»œæžã§ããŸããããšãã°ãWeb ãµã€ããããŒã¿ããŒã¹èŠæ±ãè¡ããšãWeb ã®ãããã¯ãŒã¯ã¬ã€ãã³ã·ãŒãå šäœã®ã¬ã€ãã³ã·ãŒã«åœ±é¿ããããã¯ãšã³ãã® VPC ã¬ã€ãã³ã·ãŒã¯ããããªåœ±é¿ããäžããŸããããã ããäŒè©±åAIã®ã¬ã€ãã³ã·ãŒã³ã³ããŒãã³ãã¯ããã»ã©å€§ããå€åããŸããããããã¯äžåäžã§ãããåã³ã³ããŒãã³ãã®ã¬ã€ãã³ã·ãŒå¯äžã¯ä»ã®ã³ã³ããŒãã³ãã®çšåºŠã®ç¯å²å ã§ãããããã£ãŠãã¬ã€ãã³ã·ãŒã¯åéšåã®åèšã«ãã£ãŠæ±ºãŸããŸãã
ã·ã¹ãã ã®ãè³ã
èªåé³å£°èªè (ASR) ã¯ãé³å£°ããã¹ãå€æ (STT) ãšãåŒã°ããé³å£°ãããã¹ãã«å€æããããã»ã¹ã§ãã
ASR ã®é 延ã¯ãããã¹ãã®çæã«ãããæéã§ã¯ãããŸãããé³å£°ããã¹ãå€æããã»ã¹ã¯ããŠãŒã¶ãŒã話ããŠããéã«ããã¯ã°ã©ãŠã³ãã§å®è¡ãããŸãã代ããã«ãé 延ã¯ãé³å£°ã®çµäºããããã¹ãçæã®çµäºãŸã§ã®æéã§ãã
ãããã£ãŠãçãçºè©±ééãšé·ãçºè©±ééã§ã¯ãåæ§ã® ASRé 延ãçºçããå¯èœæ§ããããŸãããŸããé³å£°ããã¹ãå€æã¢ãã«ã¯ããªãæé©åãããŠããããããããã¯ãŒã¯ã®åŸåŸ©ãå«ããé 延ã¯éåžž 100 ããªç§æªæºã§ããïŒã¢ãã«ããã©ãŠã¶ã«åã蟌ãŸããŠããããããããã¯ãŒã¯ã¬ã€ãã³ã·ãŒããŸã£ããçºçããªãå ŽåããããŸããChrome/Chromiumãªã©ïŒ
ã·ã¹ãã ã®ãæ¬èœã
ã¿ãŒã³ãã€ãã³ã°/å²ã蟌㿠(TTI) ã¯ããŠãŒã¶ãŒããã€è©±ãçµããããå€æããäžéããã»ã¹ã§ããåºç€ãšãªãã¢ãã«ã¯ãé³å£°åºéæ€åº (VAD) ãšããŠç¥ãããŠããŸãã
ã¿ãŒã³ãã€ãã³ã°ã«ã¯è€éãªäžé£ã®ã«ãŒã«ãé¢ä¿ããŸããçãçºè©±ïŒãããããªã©ïŒã§ã¿ãŒã³ãéå§ãã¹ãã§ã¯ãããŸãããããããªããšãäŒè©±ãããŸãã«ãã¹ã¿ãã«ãŒãã«æããããŸãã代ããã«ããŠãŒã¶ãŒïŒè©±è ïŒãå®éã«ã¢ãã«ã®æ³šæãåŒãããšããŠããã¿ã€ãã³ã°ãè©äŸ¡ããå¿ èŠããããŸãããŸãããŠãŒã¶ãŒãèªåã®èããäŒãçµãããã©ãããå€æããå¿ èŠããããŸãã
åªããVADïŒé³å£°åºéæ€åºïŒã¯ã ç¡é³ãæ€åºããŠã æ°ããã¿ãŒã³ãéå§ããä¿¡å·ãåºããŸãããåèªïŒããã³ãã¬ãŒãºïŒã®éã«ã¯æ²é»ããããã¢ãã«ã¯ãŠãŒã¶ãŒãå®éã«è©±ãçµããããšã確信ããå¿ èŠããããŸããããã確å®ã«éæããã«ã¯ãæ²é»ã®éŸå€ïŒããå ·äœçã«ã¯ãçºè©±ã®æ¬ åŠïŒãæ¢ãå¿ èŠããããŸãããã®ããã»ã¹ã«ããé 延ãçºçãããŠãŒã¶ãŒã«ãšã£ãŠã¯åŸ ã¡æéãçºçããŸãã
æè¡çã«èšãã°ãä»ã®ãã¹ãŠã®äŒè©±åAIã³ã³ããŒãã³ãã ãŒãé 延ã§ãã£ãå Žåã TTIã«èµ·å ããé 延ã¯è¯ãããšã ãšèšããŸãã人éã¯èšèã«åå¿ãããŸã§ã«å°ãæéãããããŸããæ©æ¢°ãåæ§ã®äžæåæ¢ãåãããšã§ãã€ã³ã¿ã©ã¯ã·ã§ã³ã«ãªã¢ãªãã£ãçãŸããŸãããã ããäŒè©±å AI ã®ä»ã®ã³ã³ããŒãã³ãã§ã¯ãã§ã«é 延ãçºçããŠãããããTTI é 延ã¯æå°éã«æããããšãçæ³çã§ãã
ã·ã¹ãã ã®ãè³ã
次ã«ãã·ã¹ãã ã¯å¿çãçæããå¿ èŠããããŸããçŸåšãããã¯éåžžãGPT-4 ã Gemini Flash 1.5 ãªã©ã®å€§èŠæš¡èšèªã¢ãã« (LLM) ã䜿çšããŠå®çŸãããŸãã
èšèªã¢ãã«ã®éžæã«ãã£ãŠå€§ããªéããçããŸããGemini Flash 1.5 ã®ãããªã¢ãã«ã¯éåžžã«é«éã§ã350 ããªç§æªæºã§åºåãçæããŸããGPT-4ç³»ã Claude ãªã©ãããè€éãªã¯ãšãªãåŠçã§ããããå ç¢ãªã¢ãã«ã§ã¯ã700 ããªç§ãã 1000 ããªç§ãããå¯èœæ§ããããŸããéåžžãäŒè©±å AI ããã»ã¹ãæé©åããéã«ã¬ã€ãã³ã·ãŒãã¿ãŒã²ããã«ããæãç°¡åãªæ¹æ³ã¯ãé©åãªã¢ãã«ãéžæããããšã§ãã
ããããLLMã®ã¬ã€ãã³ãŒã¯ã ããŒã¯ã³ã®çæãéå§ ãããŸã§ã®æéã§ãããããã®ããŒã¯ã³ã¯ã次ã®ããã¹ãèªã¿äžãããã»ã¹ã«ããã«ã¹ããªãŒãã³ã°ã§ããŸããããã¹ãèªã¿äžãã¯äººéã®å£°ã®èªç¶ãªããŒã¹ã«ãã£ãŠé ããªããããLLM ã¯ç¢ºå®ã«ãããäžåãããšãã§ããŸããéèŠãªã®ã¯ãæåã®ããŒã¯ã³ã®é 延 (ã€ãŸããæåã®ãã€ããŸã§ã®æé) ã ãã§ããtext-to-speech process. Because text-to-speech is slowed by the natural pace of a human voice, the LLM reliably outpaces itâwhat matters most is the first token latency (i.e., time to first byte).
LLM ã®ã¬ã€ãã³ã·ãŒã«åœ±é¿ãäžããèŠå ã¯ãã¢ãã«ã®éžæ以å€ã«ãååšããŸãããããã«ã¯ãããã³ããã®é·ããšç¥èããŒã¹ã®ãµã€ãºãå«ãŸããŸããã©ã¡ããã倧ããã»ã©ãåŸ ã¡æéã¯é·ããªããŸããçµå±ã®ãšãããLLM ãèæ ®ããå¿ èŠãããé ç®ãå¢ããã»ã©ãæéãããããšããåçŽãªååã«ãªããŸãããããã£ãŠãäŒæ¥ã¯ã¢ãã«ã«é床ã®è² æ ããããã«ãé©åãªéã®ã³ã³ããã¹ãã®ãã©ã³ã¹ããšãå¿ èŠããããŸãã
ã·ã¹ãã ã®ãå£ã
äŒè©±åAI ã®æåŸã®ã³ã³ããŒãã³ãã¯ãããã¹ãèªã¿äžã (TTS) ã§ããããã¹ãèªã¿äžãã®ãããé 延ãšã¯ãããã¹ãåŠçããå ¥åããŒã¯ã³ãåãåã£ãŠããèªã¿äžããéå§ãããŸã§ã«ãããæéã§ãããããè¿œå ã®ããŒã¯ã³ã¯äººéã®é³å£°ãããéãé床ã§å©çšå¯èœã«ãªããããããã¹ãèªã¿äžãã®é 延**ã¯å³å¯ã«ã¯æåã®ãã€ããŸã§ã®æéã«ãªããŸããtext-to-speech (TTS). Text-to-speechâs net latency is the time it takes to begin speaking after receiving input tokens from text-processing. Thatâs itâbecause additional tokens are made available at a rate faster than human speech, text-to-speechâs latency is strictly the time to first byte.
以åã¯ãããã¹ãèªã¿äžãã¯ç¹ã«é ããé³å£°ãçæããã®ã« 2ïœ3ç§ãããã£ãŠããŸããããã ããåœç€Ÿã®ã¿ãŒããšã³ãžã³ã®ãããªæå 端ã®ã¢ãã«ã§ã¯ãããã 300 ããªç§ã®é 延ã§é³å£°ãçæã§ããŸããå®éãåœç€Ÿã® V2/V2.5 ãã©ãã·ã¥ TTS ãšã³ãžã³ã¯ãæåã®ãã€ãã®ãªãŒãã£ãªé 延ã 200ããªç§ãšããããã®åéã§æé«ã®ã¹ã³ã¢ãéæããŠããŸã (å°ãèªæ ¢ããªããã°ãªããŸãã!)ãtext-to-speech was particularly slow, taking as long as 2-3s to generate speech. However, state-of-the-art models like our Turbo engine are able to generate speech with just 300ms of latency the the new Flash TTS engine is even faster. Flash has a model time of 75ms and can achieve an e2e 135ms of time to first byte audio latency, the best score in the field (we have to brag a little!).
4ã€ã®ã³ã³ããŒãã³ã以å€ã«ããäŒè©±åAIã®ãããé 延ã«åœ±é¿ãäžããèŠå ãããã€ããããŸãã
ããå Žæããå¥ã®å Žæã«ããŒã¿ãéä¿¡ããéã«ã¯ãåžžã«é 延ãçºçããŸããäžéšã®äŒè©±åAIã¢ããªã±ãŒã·ã§ã³ã§ã¯ãASRãTTIãLLMãããã³ TTS ããã»ã¹ãåãå Žæã«é 眮ãããã®ãçæ³çã§ãããã®ãããé倧ãªãããã¯ãŒã¯é 延ã®å¯äžã®çºçæºã¯ãã¹ããŒã«ãŒãšã·ã¹ãã å šäœã®éã®ãã¹ã«ãªããŸããTTS processes should ideally be co-located, so the only source of non-trivial network latency is the paths between speaker and the entire system. This gives us an advantage on latency as we can save two server calls since we have our own TTS and an internal transcription solution.
æ©èœãåŒã³åºãïŒã€ãŸããããŒã«ããµãŒãã¹ãšã®ã€ã³ã¿ãŒãã§ãŒã¹ïŒããã®äŒè©±åAIã¢ããªã±ãŒã·ã§ã³ãå€æ°ååšããŸããããšãã°ãAIã«å€©æ°ã調ã¹ãããã«å£é ã§äŸé ŒãããããããŸãããããã«ã¯ãããã¹ãåŠçã¬ã€ã€ãŒã§åŒã³åºãããè¿œå ã® API åŒã³åºããå¿ èŠã«ãªããããŒãºã«å¿ããŠå€§å¹ ã«é·ãé 延ãçºçããå¯èœæ§ããããŸããAPI calls invoked at the text-processing layer, which can incur significantly more latency depending on the needs.
ããšãã°ããã¶ãå£é ã§æ³šæããå¿ èŠãããå Žåãè€æ°ã® API åŒã³åºããå¿ èŠã«ãªãå¯èœæ§ãããããã®äžã«ã¯é床ã®é 延ãçºçãããã®ããããŸã (äŸ: ã¯ã¬ãžããã«ãŒãã®åŠç)ãAPI calls that are necessary, some with excessive lag (e.g. processing a credit card).
ããããäŒè©±åAIã·ã¹ãã ã¯ãé¢æ°åŒã³åºããå®äºããåã«LLMã«ãŠãŒã¶ãŒã«å¿çããããã«ä¿ãããšã§ãé¢æ°åŒã³åºãã«é¢é£ããé 延ã«å¯ŸåŠããããšãã§ããŸãïŒäŸïŒã倩æ°ã調ã¹ãŠã¿ãŸãããããããã¯çŸå®ã®äŒè©±ãã¢ãã«åããŠããããŠãŒã¶ãŒãé¢äžããã«çµããããšã¯ãããŸããã
ãããã®éåæãã¿ãŒã³ã¯éåžžãWebhook ã掻çšããŠé·æéå®è¡ããããªã¯ãšã¹ããåé¿ããããšã«ãã£ãŠå®çŸãããŸãã
äŒè©±åAIãã©ãããã©ãŒã ã®ãã 1 ã€ã®äžè¬çãªæ©èœã¯ããŠãŒã¶ãŒãé»è©±ã§ãã€ã€ã«ã€ã³ã§ããããã«ããããšã§ã (ãŸãã¯ãå Žåã«ãã£ãŠã¯ããŠãŒã¶ãŒã«ä»£ãã£ãŠé»è©±ããããããšãã§ããŸã)ãé»è©±éä¿¡ã§ã¯è¿œå ã®é 延ãçºçããŸããããã®é 延ã¯å°ççãªæ¡ä»¶ã«å€§ããäŸåããå¯èœæ§ããããŸãã
åºæ¬çã«ãé»è©±ãåããªãŒãžã§ã³ã«éå®ãããŠããå Žåãè¿œå ã®200ããªç§ã®é 延ãçºçããŸããåœéé話ïŒäŸïŒã¢ãžã¢ â ç±³åœïŒã®å Žåãé 延ãçŽ500ããªç§ã«éãã移åæéãå€§å¹ ã«é·ããªãå¯èœæ§ããããŸãããã®ãã¿ãŒã³ã¯ããŠãŒã¶ãŒãæåšããå°åå€ã®é»è©±çªå·ãæã£ãŠããå Žåã«ããèŠãããå¯èœæ§ããããŸãããã®å ŽåãããŒã¹åœã®é»è©±ãããã¯ãŒã¯ãžã®ãããã匷å¶ãããŸãã
äŒè©±å AI ã«é¢ãããã®åŸåŸ©ã®èª¿æ»ãèå³æ·±ããã®ã§ãã£ãããšãé¡ã£ãŠããŸããèŠçŽãããšãã¢ããªã±ãŒã·ã§ã³ã¯ 1ç§æªæºã®ã¬ã€ãã³ã·ãŒãç®æšã«ããå¿ èŠããããŸããããã¯éåžžãã¿ã¹ã¯ã«é©ãã LLM ãéžæããããšã§å®çŸã§ããŸãããŸããããè€éãªããã»ã¹ãããã¯ã°ã©ãŠã³ãã§å®è¡ãããŠãããšãã¯ãé·æéã®åæ¢ãé²ãããã«ããŠãŒã¶ãŒãšã€ã³ã¿ãŒãã§ãŒã¹ããšãå¿ èŠããããŸãã
çµå±ã®ãšãããç®æšã¯ãªã¢ãªãºã ãçã¿åºãããšã§ãããŠãŒã¶ãŒã¯ãã³ã³ãã¥ãŒã¿ ããã°ã©ã ã®å©ç¹ã享åããªããã人éãšè©±ãæ°æ¥œããæããå¿ èŠããããŸãããµãããã»ã¹ã匷åããããšã§ããããå¯èœã«ãªããŸããã
Elevenlabs ã§ã¯ãæå 端㮠STT ããã³ TTS ã¢ãã«ã䜿çšããŠãäŒè©±åAIã·ã¹ãã ã®ããããéšåãæé©åããŠããŸããããã»ã¹ã®åéšåã«åãçµãããšã§ãã·ãŒã ã¬ã¹ãªäŒè©±ãããŒãå®çŸã§ããŸãããã®ãããããŠã³ã®ãªãŒã±ã¹ãã¬ãŒã·ã§ã³ã®èŠç¹ã«ããããããã段éã§ã¬ã€ãã³ã·ãŒãå°ãïŒ 1ããªç§ã§ãïŒåæžã§ããŸããTTS models. By working on each part of the process, we can achieve seamless conversation flows. This top-down view on orchestration allows us to shave off a little latencyâeven 1msâat every juncture.
ãŠãŒã¶ãŒããã®åãåããã®80%以äžã解決
ã«ã¹ã¿ãã€ãºå¯èœãªã€ã³ã¿ã©ã¯ãã£ãé³å£°ãšãŒãžã§ã³ããæ§ç¯ããããã®ãªãŒã«ã€ã³ã¯ã³ãã©ãããã©ãŒã