For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Connect
BlogHelp CenterAPI PricingSign up
OverviewElevenCreativeElevenAgentsElevenAPIReception AIAPI referenceChangelog
OverviewElevenCreativeElevenAgentsElevenAPIReception AIAPI referenceChangelog
  • Get started
    • Quickstart
    • Agents Quickstart
    • Choosing the right model
  • Tutorials
    • Text to Speech
    • Speech to Text
    • Speech Engine
    • Music
    • Text to Dialogue
    • Voice Changer
    • Voice Isolator
    • Dubbing
    • Sound effects
    • Forced Alignment
  • Concepts
    • Understanding audio streaming
    • Understanding latency
    • Voice cloning
  • How-to guides
        • Multichannel transcription
        • Webhooks
        • Keyterm prompting
        • Entity detection
        • Telegram bot
        • Vercel AI SDK
  • Reference
    • Libraries & SDKs
    • Errors
    • Agent tooling
    • Webhooks
    • Zero Retention Mode
    • Breaking changes policy
    • UI components
    • Example projects
    • Next.js template
    • Showcase
  • Private deployment
    • Overview
LogoLogo
Login
Login
Connect
BlogHelp CenterAPI PricingSign up
On this page
  • ElevenLabs Provider
  • Setup
  • Provider Instance
  • Transcription Models
  • Next steps
How-to guidesSpeech to TextBatch

Vercel AI SDK

Use the ElevenLabs Provider in the Vercel AI SDK to transcribe speech from audio and video files.
Was this page helpful?
Previous

Client-side streaming

This guide shows you how to transcribe audio in real time on the client side using ElevenLabs.
Next
Built with

How-to guide · Assumes you have completed the Speech to Text quickstart and have a Vercel project set up.

ElevenLabs Provider

The ElevenLabs provider provides support for the ElevenLabs transcription API.

Setup

The ElevenLabs provider is available in the @ai-sdk/elevenlabs module. You can install it with npm:

1npm install @ai-sdk/elevenlabs

Provider Instance

You can import the default provider instance elevenlabs from @ai-sdk/elevenlabs:

1import { elevenlabs } from "@ai-sdk/elevenlabs";

If you need a customized setup, you can import createElevenLabs from @ai-sdk/elevenlabs and create a provider instance with your settings:

1import { createElevenLabs } from "@ai-sdk/elevenlabs";
2
3const elevenlabs = createElevenLabs({
4 // custom settings, e.g.
5 fetch: customFetch,
6});

You can use the following optional settings to customize the ElevenLabs provider instance:

  • apiKey string

    API key that is being sent using the Authorization header. It defaults to the ELEVENLABS_API_KEY environment variable.

  • headers Record<string,string>

    Custom headers to include in the requests.

  • fetch (input: RequestInfo, init?: RequestInit) => Promise<Response>

    Custom fetch implementation. Defaults to the global fetch function. You can use it as a middleware to intercept requests, or to provide a custom fetch implementation for e.g. testing.

Transcription Models

You can create models that call the ElevenLabs transcription API using the .transcription() factory method.

The first argument is the model id e.g. scribe_v2.

1const model = elevenlabs.transcription("scribe_v2");

You can also pass additional provider-specific options using the providerOptions argument. For example, supplying the input language in ISO-639-1 (e.g. en) format can sometimes improve transcription performance if known beforehand.

1import { elevenlabs } from "@ai-sdk/elevenlabs";
2import { experimental_transcribe as transcribe } from "ai";
3
4const result = await transcribe({
5 model: elevenlabs.transcription("scribe_v2"),
6 audio: new Uint8Array([1, 2, 3, 4]),
7 providerOptions: { elevenlabs: { languageCode: "en" } },
8});

The following provider options are available:

  • languageCode string

    An ISO-639-1 or ISO-639-3 language code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Defaults to null, in which case the language is predicted automatically.

  • tagAudioEvents boolean

    Whether to tag audio events like (laughter), (footsteps), etc. in the transcription. Defaults to true.

  • numSpeakers integer

    The maximum amount of speakers talking in the uploaded file. Can help with predicting who speaks when. The maximum amount of speakers that can be predicted is 32. Defaults to null, in which case the amount of speakers is set to the maximum value the model supports.

  • timestampsGranularity enum

    The granularity of the timestamps in the transcription. Defaults to 'word'. Allowed values: 'none', 'word', 'character'.

  • diarize boolean

    Whether to annotate which speaker is currently talking in the uploaded file. Defaults to true.

  • fileFormat enum

    The format of input audio. Defaults to 'other'. Allowed values: 'pcm_s16le_16', 'other'. For 'pcm_s16le_16', the input audio must be 16-bit PCM at a 16kHz sample rate, single channel (mono), and little-endian byte order. Latency will be lower than with passing an encoded waveform.

Next steps

Server-side streaming

Transcribe audio in real time using the WebSocket-based streaming API.

API reference

Full Speech to Text API reference and parameters.