跳到内容

与雕像对话:构建多模态 ElevenAgents 驱动的应用

Letters on the noisy background

拍摄雕像,识别人物,然后与他们实时语音对话——每个角色都用独特且符合时代的声音说话。

这些都可以通过 ElevenLabs 的声音设计和 Agent API 实现。本文将介绍一个结合了计算机视觉和语音生成的移动 Web 应用架构,让公共雕像变成可互动体验。下方 API 和代码示例均可复现。

跳过教程,一条提示词直接生成

下方整个应用仅用一条提示词构建,并已在光标 中用 Claude Opus 4.5(高配)从空白 NextJS 项目一键生成。如果想直接上手,可将以下内容粘贴到编辑器:

We need to make an app that:
- is optimised for mobile
- allows the user to take a picture (of a statue, picture, monument, etc) that includes one or more people
- uses an OpenAI LLM api call to identify the statue/monument/picture, characters within it, the location, and name
- allows the user to check it's correct, and then do either a deep research or a standard search to get information about the characters and the statue's history, and it's current location
- then create an ElevenLabs agent (allowing multiple voices), that the user can then talk to as though they're talking to the characters in the statue. Each character should use voice designer api to create a matching voice.
The purpose is to be fun and educational.

https://elevenlabs.io/docs/eleven-api/guides/cookbooks/voices/voice-design
https://elevenlabs.io/docs/eleven-agents/quickstart 
https://elevenlabs.io/docs/api-reference/agents/create


你也可以使用ElevenLabs 智能体技能,无需查阅文档。这些基于文档,效果甚至更好。

下文将详细拆解该提示词生成的内容。

工作原理

流程共分为五步:

  1. 拍摄图片
  2. 识别艺术品及人物(OpenAI)
  3. 查找历史背景(OpenAI)
  4. 为每个角色生成独特音色(ElevenAPI)
  5. 通过 WebRTC 实时语音对话(ElevenAgents)

用视觉识别雕像

用户拍摄雕像后,图片会发送到支持视觉的 OpenAI 模型。结构化系统提示词会提取艺术品名称、地点、艺术家、年代,以及每个角色的详细声音描述。系统提示词包含预期的 JSON 输出格式:

{
  "statueName": "string - name of the statue, monument, or artwork",
  "location": "string - where it is located (city, country)",
  "artist": "string - the creator of the artwork",
  "year": "string - year completed or unveiled",
  "description": "string - brief description of the artwork and its historical significance",
  "characters": [
    {
      "name": "string - character name",
      "description": "string - who this person was and their historical significance",
      "era": "string - time period they lived in",
      "voiceDescription": "string - detailed voice description for Voice Design API (include audio quality marker, age, gender, vocal qualities, accent, pacing, and personality)"
    }
  ]
}
const response = await openai.chat.completions.create({
  model: "gpt-5.2",
  response_format: { type: "json_object" },
  messages: [
    { role: "system", content: SYSTEM_PROMPT },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Identify this statue/monument/artwork and all characters depicted.",
        },
        {
          type: "image_url",
          image_url: {
            url: `data:image/jpeg;base64,${base64Data}`,
            detail: "high",
          },
        },
      ],
    },
  ],
  max_completion_tokens: 2500,
});

以伦敦威斯敏斯特大桥上的布狄卡雕像为例,返回结果如下:

{
  "statueName": "Boudica and Her Daughters",
  "location": "Westminster Bridge, London, UK",
  "artist": "Thomas Thornycroft",
  "year": "1902",
  "description": "Bronze statue depicting Queen Boudica riding a war chariot with her two daughters, commemorating her uprising against Roman occupation of Britain.",
  "characters": [
    {
      "name": "Boudica",
      "description": "Queen of the Iceni tribe who led an uprising against Roman occupation",
      "era": "Ancient Britain, 60-61 AD",
      "voiceDescription": "Perfect audio quality. A powerful woman in her 30s with a deep, resonant voice and a thick Celtic British accent. Her tone is commanding and fierce, with a booming quality that projects authority. She speaks at a measured, deliberate pace with passionate intensity."
    },
    // Other characters in the statue
  ]
}

如何写出有效的声音描述

声音描述的质量直接决定生成音色的效果。声音设计提示词指南有详细说明,重点包括:音质标注(“完美音质。”)、年龄和性别、语气/音色(低沉、浑厚、沙哑)、具体口音(如“浓重的凯尔特英国口音”,不要只写“英国口音”),以及语速。描述越具体,效果越准确——比如“60 多岁、幽默感十足的纽约老太太”远比“年长女性声音”更好。

指南还提到几点:描述口音时用“浓重”而非“强烈”,避免用“外国人”等模糊词汇;虚构或历史人物可参考现实口音(如“古代凯尔特女王,浓重英国口音,庄重有威严”)。

用声音设计创建角色音色

声音设计 API可根据文本描述生成全新合成音色,无需音频样本或克隆。非常适合没有原始音频的历史人物。

流程分两步。

生成试听

const { previews } = await elevenlabs.textToVoice.design({
  modelId: "eleven_multilingual_ttv_v2",
  voiceDescription: character.voiceDescription,
  text: sampleText,
});

文本内容很重要。更长、贴合角色的样本文本(50 字以上)能生成更稳定的音色——建议用角色台词而非通用问候语。声音设计提示词指南有详细说明。

保存音色

试听生成后,选择一个并创建为永久音色:

const voice = await elevenlabs.textToVoice.create({
  voiceName: `StatueScanner - ${character.name}`,
  voiceDescription: character.voiceDescription,
  generatedVoiceId: previews[0].generatedVoiceId,
});

多角色雕像可并行生成音色。五个角色的音色生成时间与单个角色几乎相同:

const results = await Promise.all(
  characters.map((character) => createVoiceForCharacter(character))
);

构建多音色 ElevenLabs Agent

音色创建完成后,下一步是配置ElevenLabs 智能体,可实时切换不同角色音色。

const agent = await elevenlabs.conversationalAi.agents.create({
  name: `Statue Scanner - ${statueName}`,
  tags: ["statue-scanner"],
  conversationConfig: {
    agent: {
      firstMessage,
      language: "en",
      prompt: {
        prompt: systemPrompt,
        temperature: 0.7,
      },
    },
    tts: {
      voiceId: primaryCharacter.voiceId,
      modelId: "eleven_v3",
      supportedVoices: otherCharacters.map((c) => ({
        voiceId: c.voiceId,
        label: c.name,
        description: c.voiceDescription,
      })),
    },
    turn: {
      turnTimeout: 10,
    },
    conversation: {
      maxDurationSeconds: 600,
    },
  },
});

多音色切换

supportedVoices 数组用于告知 Agent 可用音色。Agents 平台会自动切换音色——当 LLM 响应中出现不同角色时,TTS 引擎会将对应片段分配到正确音色。

群聊提示词设计

让多个角色像真实群体互动,而非轮流问答,需要精心设计提示词:

const multiCharacterRules = `
MULTI-CHARACTER DYNAMICS:
You are playing ALL ${characters.length} characters simultaneously.
Make this feel like a group conversation, not an interview.

- Characters should interrupt each other:
  "Actually, if I may -" / "Wait, I must say -"

- React to what others say:
  "Well said." / "I disagree with that..." / "Always so modest..."

- Have side conversations:
  "Do you remember when -" / "Tell them about the time you -"

The goal is for users to feel like they are witnessing a real exchange
between people who happen to include them.
`;

WebRTC 实时语音

最后一步是客户端连接。ElevenLabs Agents 支持 WebRTC,实现低延迟语音对话——比基于 WebSocket 的连接更快,有助于自然轮流发言。

服务端:获取会话 token

const { token } = await client.conversationalAi.conversations.getWebrtcToken({
    agentId,
});

客户端:启动会话

import { useConversation } from "@elevenlabs/react";

const conversation = useConversation({
  onConnect: () => setIsSessionActive(true),
  onDisconnect: () => setIsSessionActive(false),
  onMessage: (message) => {
    if (message.source === "ai") {
      setMessages((prev) => [...prev, { role: "agent", text: message.message }]);
    }
  },
});

await conversation.startSession({
  agentId,
  conversationToken: token,
  connectionType: "webrtc",
});

useConversation hook 负责音频采集、流式传输、语音活动检测和播放。

用网页搜索丰富历史资料

如需在对话前了解更多历史背景,可用 OpenAI 的网页搜索工具添加增强研究模式:

const response = await openai.responses.create({
  model: "gpt-5.2",
  instructions: RESEARCH_SYSTEM_PROMPT,
  tools: [{ type: "web_search_preview" }],
  input: `Research ${identification.statueName}. Search for current information
including location, visiting hours, and recent news about the artwork.`,
});

项目总结

本项目展示了结合文本、研究、视觉和音频等多模态 AI,可打造跨越数字与现实世界的互动体验。多模态智能体还有很多未被探索的潜力,欢迎大家在教育、工作和娱乐等场景中尝试。

开始构建

本项目用到的 API——声音设计,ElevenAgents 和 OpenAI——现已开放使用。

查看更多 ElevenLabs 团队的文章

用高质量 AI 音频创作