동상과 대화하기: 멀티모달 ElevenAgents 기반 앱 만들기

작성자: Joe Reeve
게시일: 2026년 2월 18일
최종 업데이트: 2026년 6월 29일

듣기이 기사 오디오로 듣기

0:00

0:000:00

동상을 촬영하세요. 조각상에 등장하는 인물을 확인한 뒤, 각 인물이 시대에 맞는 개성 있는 목소리로 실시간 음성 대화를 나눌 수 있습니다.

이 모든 것은 ElevenLabs의 보이스 디자인(Voice Design)과 에이전트(Agent) API로 구현할 수 있습니다. 이 글에서는 컴퓨터 비전과 음성 생성을 결합해 공공 조형물을 인터랙티브한 경험으로 바꾸는 모바일 웹앱의 구조를 소개합니다. 아래 API와 코드 샘플만으로 누구나 따라할 수 있습니다.

튜토리얼 건너뛰기 - 한 번의 프롬프트로 바로 만들기

아래 앱 전체는 단 하나의 프롬프트로 제작되었으며, 커서에서 NextJS 빈 프로젝트에 Claude Opus 4.5(상위)로 원샷 테스트를 완료했습니다. 바로 시작하고 싶다면 아래 코드를 에디터에 붙여넣으세요:

We need to make an app that:
- is optimised for mobile
- allows the user to take a picture (of a statue, picture, monument, etc) that includes one or more people
- uses an OpenAI LLM api call to identify the statue/monument/picture, characters within it, the location, and name
- allows the user to check it's correct, and then do either a deep research or a standard search to get information about the characters and the statue's history, and it's current location
- then create an ElevenLabs agent (allowing multiple voices), that the user can then talk to as though they're talking to the characters in the statue. Each character should use voice designer api to create a matching voice.
The purpose is to be fun and educational.

https://elevenlabs.io/docs/eleven-api/guides/cookbooks/voices/voice-design
https://elevenlabs.io/docs/eleven-agents/quickstart 
https://elevenlabs.io/docs/api-reference/agents/create

또한 ElevenLabs 에이전트 스킬를 문서 대신 사용할 수도 있습니다. 이 기능은 문서를 기반으로 하며, 더 나은 결과를 얻을 수 있습니다.

이후 내용에서는 해당 프롬프트가 어떤 결과를 내는지 단계별로 설명합니다.

작동 방식

파이프라인은 다섯 단계로 구성됩니다:

이미지 촬영
작품과 등장인물 식별(OpenAI)
역사적 배경 조사(OpenAI)
각 캐릭터마다 고유한 음성 생성 (ElevenAPI)
WebRTC를 통한 실시간 음성 대화 시작(ElevenAgents)

비전으로 동상 식별하기

사용자가 조각상을 촬영하면

{
  "statueName": "string - name of the statue, monument, or artwork",
  "location": "string - where it is located (city, country)",
  "artist": "string - the creator of the artwork",
  "year": "string - year completed or unveiled",
  "description": "string - brief description of the artwork and its historical significance",
  "characters": [
    {
      "name": "string - character name",
      "description": "string - who this person was and their historical significance",
      "era": "string - time period they lived in",
      "voiceDescription": "string - detailed voice description for Voice Design API (include audio quality marker, age, gender, vocal qualities, accent, pacing, and personality)"
    }
  ]
}

const response = await openai.chat.completions.create({
  model: "gpt-5.2",
  response_format: { type: "json_object" },
  messages: [
    { role: "system", content: SYSTEM_PROMPT },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Identify this statue/monument/artwork and all characters depicted.",
        },
        {
          type: "image_url",
          image_url: {
            url: `data:image/jpeg;base64,${base64Data}`,
            detail: "high",
          },
        },
      ],
    },
  ],
  max_completion_tokens: 2500,
});

런던 웨스트민스터 브리지의 부디카 동상 사진을 예로 들면, 응답은 다음과 같습니다:

{
  "statueName": "Boudica and Her Daughters",
  "location": "Westminster Bridge, London, UK",
  "artist": "Thomas Thornycroft",
  "year": "1902",
  "description": "Bronze statue depicting Queen Boudica riding a war chariot with her two daughters, commemorating her uprising against Roman occupation of Britain.",
  "characters": [
    {
      "name": "Boudica",
      "description": "Queen of the Iceni tribe who led an uprising against Roman occupation",
      "era": "Ancient Britain, 60-61 AD",
      "voiceDescription": "Perfect audio quality. A powerful woman in her 30s with a deep, resonant voice and a thick Celtic British accent. Her tone is commanding and fierce, with a booming quality that projects authority. She speaks at a measured, deliberate pace with passionate intensity."
    },
    // Other characters in the statue
  ]
}

효과적인 목소리 설명 작성법

목소리 설명의 품질이 곧 생성되는 목소리의 품질을 좌우합니다. 보이스 디자인 프롬프트 가이드에서 자세히 다루고 있지만, 꼭 포함해야 할 핵심 요소는 다음과 같습니다: 오디오 품질 표시("완벽한 오디오 품질입니다."), 나이와 성별, 톤/음색(깊고, 울림 있는, 거친 등), 정확한 억양("두꺼운 켈트계 영국 억양"처럼 구체적으로), 그리고 말하는 속도입니다. 더 구체적인 프롬프트가 더 정확한 결과를 만듭니다. 예를 들어 "60대 뉴요커 여성, 건조한 유머 감각"이 "나이 든 여성 목소리"보다 훨씬 좋은 결과를 냅니다.

가이드에서 참고할 점: 억양의 강도를 표현할 때는 "강한" 대신 "두꺼운"을 사용하고, "외국인"처럼 모호한 표현은 피하세요. 허구 또는 역사적 인물의 경우 실제 억양을 참고로 제시할 수 있습니다(예: "두꺼운 영국 억양의 고대 켈트 여왕, 위엄 있고 당당한 목소리").

캐릭터 음성 만들기

보이스 디자인 API는 텍스트 설명만으로 새로운 합성 목소리를 생성합니다. 음성 샘플이나 복제가 필요하지 않아, 실제 음성이 남아 있지 않은 역사적 인물에도 적합합니다.

과정은 두 단계로 이루어집니다.

미리 듣기 생성

const { previews } = await elevenlabs.textToVoice.design({
  modelId: "eleven_multilingual_ttv_v2",
  voiceDescription: character.voiceDescription,
  text: sampleText,
});

텍스트 파라미터가 중요합니다. 50단어 이상의 인물에 어울리는 긴 샘플 텍스트가 더 안정적인 결과를 만듭니다. 인사말 대신 인물의 대사에 맞춰 작성하세요. 보이스 디자인 프롬프트 가이드에서 더 자세히 확인할 수 있습니다.

목소리 저장

미리 듣기가 생성되면, 원하는 목소리를 선택해 영구적으로 저장하세요:

const voice = await elevenlabs.textToVoice.create({
  voiceName: `StatueScanner - ${character.name}`,
  voiceDescription: character.voiceDescription,
  generatedVoiceId: previews[0].generatedVoiceId,
});

여러 인물이 있는 동상도 동시에 목소리를 생성할 수 있습니다. 다섯 명의 목소리도 한 명과 거의 같은 시간에 만들어집니다:

const results = await Promise.all(
  characters.map((character) => createVoiceForCharacter(character))
);

멀티 보이스 ElevenLabs Agent 만들기

목소리 생성이 끝나면, 다음 단계는 ElevenLabs 에이전트를 설정해 실시간으로 인물별 목소리를 전환하는 것입니다.

const agent = await elevenlabs.conversationalAi.agents.create({
  name: `Statue Scanner - ${statueName}`,
  tags: ["statue-scanner"],
  conversationConfig: {
    agent: {
      firstMessage,
      language: "en",
      prompt: {
        prompt: systemPrompt,
        temperature: 0.7,
      },
    },
    tts: {
      voiceId: primaryCharacter.voiceId,
      modelId: "eleven_v3",
      supportedVoices: otherCharacters.map((c) => ({
        voiceId: c.voiceId,
        label: c.name,
        description: c.voiceDescription,
      })),
    },
    turn: {
      turnTimeout: 10,
    },
    conversation: {
      maxDurationSeconds: 600,
    },
  },
});

멀티 보이스 전환

supportedVoices 배열은 에이전트에게 사용 가능한 목소리를 알려줍니다. Agents 플랫폼은 목소리 전환을 자동으로 처리합니다. LLM의 응답에서 다른 인물이 말하는 것으로 나타나면

그룹 대화를 위한 프롬프트 설계

여러 인물이 실제 그룹처럼 느껴지게 하려면(단순한 Q&A가 아니라), 프롬프트를 신중하게 설계해야 합니다:

const multiCharacterRules = `
MULTI-CHARACTER DYNAMICS:
You are playing ALL ${characters.length} characters simultaneously.
Make this feel like a group conversation, not an interview.

- Characters should interrupt each other:
  "Actually, if I may -" / "Wait, I must say -"

- React to what others say:
  "Well said." / "I disagree with that..." / "Always so modest..."

- Have side conversations:
  "Do you remember when -" / "Tell them about the time you -"

The goal is for users to feel like they are witnessing a real exchange
between people who happen to include them.
`;

WebRTC 기반 실시간 음성 대화

마지막 단계는 클라이언트 연결입니다. ElevenLabs Agents는 WebRTC를 지원해 지연이 거의 없는 음성 대화를 제공합니다. WebSocket 기반 연결보다 훨씬 빠르기 때문에 자연스러운 대화 흐름에 중요합니다.

서버: 대화 토큰 발급

const { token } = await client.conversationalAi.conversations.getWebrtcToken({
    agentId,
});

클라이언트: 세션 시작

import { useConversation } from "@elevenlabs/react";

const conversation = useConversation({
  onConnect: () => setIsSessionActive(true),
  onDisconnect: () => setIsSessionActive(false),
  onMessage: (message) => {
    if (message.source === "ai") {
      setMessages((prev) => [...prev, { role: "agent", text: message.message }]);
    }
  },
});

await conversation.startSession({
  agentId,
  conversationToken: token,
  connectionType: "webrtc",
});

useConversation 훅은 오디오 캡처, 스트리밍, 음성 감지, 재생까지 모두 처리합니다.

웹 검색으로 더 깊은 정보 추가하기

대화 시작 전 더 많은 역사적 맥락을 원하는 사용자를 위해, OpenAI의 웹 검색 도구로 심화 리서치 모드를 추가할 수 있습니다:

const response = await openai.responses.create({
  model: "gpt-5.2",
  instructions: RESEARCH_SYSTEM_PROMPT,
  tools: [{ type: "web_search_preview" }],
  input: `Research ${identification.statueName}. Search for current information
including location, visiting hours, and recent news about the artwork.`,
});

프로젝트를 통해 얻은 점

이 프로젝트는 텍스트, 리서치, 비전, 오디오 등 다양한 AI 모달리티를 결합하면 디지털과 현실 세계를 넘나드는 새로운 경험을 만들 수 있음을 보여줍니다. 멀티모달 에이전트에는 교육, 업무, 엔터테인먼트 등에서 아직 활용되지 않은 잠재력이 많으니, 더 많은 분들이 도전해 보시길 바랍니다.

지금 바로 시작하기

이 프로젝트에 사용된 API - 보이스 디자인,ElevenAgents, 그리고 OpenAI - 모두 지금 바로 이용할 수 있습니다.