Meet KUBI the Conversational Robot Barista

KUBI is a conversational barista that works with ElevenLabs' Conversational AI. Here's how.

KUBI is a conversational robot barista and receptionist at Second Space, a next-gen 24/7 co-working space in Kaohsiung, Taiwan. Since the workspace operation is fully automated, it’s very important for KUBI — as the first interaction point with members — to add a unique friendly touch. That’s why Second Space chose ElevenLabs’ Conversational AI to create fun and memorable interactions with members. Let’s see KUBI in action.

How KUBI works

KUBI employs a sophisticated multi-sensory architecture to simulate human-like interaction. The system hinges on a microservices architecture, where specialized services operate concurrently and communicate via a real-time event stream. These services manage various tasks, including facial and object recognition using real-time AI inference, cup detection and sanity checks via cameras, receipt printing, secure facial recognition for access control, and precise control of milk and bean dispensers.

These are some of the services that are running concurrently:

  • Environment Camera Service: Uses real-time AI inference (PyTorch in Python) to spot faces and objects.
  • Tablet Camera Service: Very similar, but detects cups on the table, and foreign objects and verifies sanity - such as, if KUBI robot is actually holding a cup.
  • Receipt Printing Service: Simple and reliable with Node + Typescript. Talks to an RS232 thermal printer.
  • Payment Service: Built with Kotlin JVM for solid concurrency and type safety. Handles government receipt reporting and communications with a credit card terminal, crypto payments gateway or online payments providers. 
  • Milk & Bean Dispensers: Separate precision services — Arduino. Time-sensitive, low latency.
  • Facial Recognition: Secure and strongly typed Kotlin Service, used for access control.
  • Water Jet Service: Automatically cleans milk steaming jugs after use — Arduino.
  • And various other services e.g. for mobile app API, menu display etc…

Why all these microservices? Easy — we manage them independently, scale easily, and use the best tools for each task.

A central event-drive core to tie it all together

Coordinating all these micr-services is a central service, humorously called "BigBoy". It’s essentially a giant, non-blocking event processor:

Here's how BigBoy works:

  1. Listens to incoming events from all services.
  2. Checks scenarios for eligible triggers.
  3. Selects the best scenario.
  4. Schedules actions for playback.
1
2internal object WeatherIdleScenario: SingleTaskScenario(scenario){
3
4 importance = Importance.Medium
5 compilationTimeout = Time.ThreeSeconds
6 interruptable = false
7 exeutionExpiration = = Time.TenSeconds
8
9 override fun isEligible(event: Event, environment: Environment): Maybe<Boolean> = withEnvironment(environment) {
10 just {
11 (event is IdleEvent
12 && !triggeredInLast(40.minutes)
13 && (personPresent() || hasActiveSessions)
14 && environment.weatherService.lastReportWithin(10.minutes))
15 }
16 }
17}
18
19private val scenario = ScenarioRecipe { event, env, session ->
20
21
22 invokeOneOf(
23
24 phrase {
25 sayWith {
26 "Rainy day today, isn't it? That's why I have my little umbrella! Look!".asEnglish
27 }.withAutoGift().withAutoMotion()
28 }.given { Weather.isRaining() },
29
30 phrase {
31 sayWith {
32 "Friend, it's so cold outside! So sad for you... because you're a human. I don't really mind!".asEnglish
33 }.withAutoMotion()
34
35 sayWith {
36 "Wait, that soudned a bit rude.".asEnglish
37 }.withAutoMotion()
38
39 }.given { Weather.isCold() },
40
41 )
42
43
44}
45
46

What are scenarios?

Think of scenarios as non-blocking compilers for robot action events. An action event is usually the most downstream event, that is the last step in a chain, resulting in a physical effect, such as motion or speech. For instance, a greeting scenario might trigger:

SayEvent("Hello! Welcome!", wave.gif)
MotionEvent(HelloMotion)

Event Generation with LLM: Some action events are automatically generated by an LLM, for example, withAutoMotion would pick the best motion from a pre-defined list based on the given context. While withAutoGif uses an LLM to generate the most suitable tag for the given phrase. The tag is used to get a GIF on Giphy, which will later be displayed on the face of KUBI together with the phrase.

Synchronization of action events: These events then flow through a scheduler that ensures speech, facial expressions, and motions stay synchronized. Synchronization ensures KUBI’s speech matches its gestures perfectly.

Flexible and Extendable

The cool thing is, that scenarios can even listen to action events and trigger new action events dynamically. For example:

  • If BigBoy detects SayEvent("Merry Christmas"), it can automatically trigger festive lights and special effects in the room.
  • Another cool example is - if the user chooses our Mobile App to make an order, all user interactions (clicking on a product, making a payment etc) are converted into events and BigBoy can also react in real-time. For instance, if the user scrolls past “Oatmilk Latte”, KUBI might say “Are you sure you don’t want that Oatmilk Latte? It’s really good!”

BigBoy literally sees and knows everything going on. Pretty cool, huh?

DevOps and Observability

Most of the services are hosted locally and are wrapped in a docker container. In the container, their lifecycle is managed by the Supervisor process control system. Error logs are collected in Sentry and are fed into a custom admin app to monitor any exceptions, real-time status of services and sensors as well as latency reportings. The cool thing is that the Flutter app was 90% generated by AI.

Using ElevenLabs to create memorable interactions

Second Space had a very specific personality in mind for KUBI - a mixture of Deadpool, Wheatley from Portal game and a bit of Pathfinder from Apex Legends. They managed to design the voice in 15 minutes complete with emotions and pauses that make the voice even more human. 

ElevenLabs powers KUBI’s speech capabilities through two core APIs:

Text-To-Speech (TTS)

  • Handles ~90% of our interactions.
  • Uses pre-designed scenarios for the perfect vibe.
  • Messages generated by LLMs can be personalized, with high-quality audio, the best pronunciation, not time-critical.
  • Offers incredibly natural multilingual speech in English, Chinese, Spanish, Japanese, and even Latvian (Latvian Deadpool, anyone?).

Conversational Mode (Real-Time)

Activated when a customer says, "Hey KUBI!", Conversational AI from ElevenLabs is able to respond in 200ms, making the interaction feel truly human-like.

  • Priority: Low latency.
  • Trades some audio quality for responsiveness.
  • Uses ElevenLabs' new real-time language_detection tool, dynamically handling different languages instantly.
  • Conversational AI session started on demand when a member enters the facility or says “Hey, KUBI!”

Custom Conversational Tools

Using ElevenLabs’ Conversational AI via WebSocket connection, KUBI can leverage function calling, for example: 

  • make_order: Recognizes orders, sends events directly into BigBoy.
  • make_payment: Immediately notifies our PaymentService to trigger the credit card machine for payments.

Switching between different LLM models easily through ElevenLabs' admin panel helps Second Space optimize understanding and accuracy, as we noticed that different models recognize the tool intents better than others. They are currently using Gemini 2.0 Flash to be their core model for Conversational AI and ChatGPT 4o for the static speech generations.

Expanding KUBI to additional markets

Second Space’s first GitHub commits referencing ElevenLabs date back to January 2023 - even before the multilanguage model was released. They recognized ElevenLab’s dedication to quality early on and confidently built out an architecture anticipating future multilingual support. Now, entering markets like Japan and South Korea is as simple as flipping a switch — no extra dev work required! 

Conclusion

Microservices, real-time events, and ElevenLabs' powerful voice technology make KUBI feel truly alive and ready to conquer and delight the world, one coffee and witty interaction at a time. 

Explore more

ElevenLabs

Create with the highest quality AI Audio

Get started free

Already have an account? Log in