Create transcript | ElevenLabs Documentation

Transcribe an audio or video file. If webhook is set to true, the request will be processed asynchronously and results sent to configured webhooks. When use_multi_channel is true and the provided audio has multiple channels, a ‘transcripts’ object with separate transcripts for each channel is returned. Otherwise, returns a single transcript.

Headers

xi-api-keystringRequired

Query parameters

enable_loggingbooleanOptionalDefaults to true

When enable_logging is set to false zero retention mode will be used for the request. This will mean history features are unavailable for this request, including request stitching. Zero retention mode may only be used by enterprise customers.

Request

This endpoint expects a multipart form containing an optional file.

model_idstringRequired

The ID of the model to use for transcription, currently only ‘scribe_v1’ and ‘scribe_v1_experimental’ are available.

filefileOptional

The file to transcribe. All major audio and video formats are supported. Exactly one of the file or cloud_storage_url parameters must be provided. The file size must be less than 1GB.

language_codestring or nullOptional

An ISO-639-1 or ISO-639-3 language_code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Defaults to null, in this case the language is predicted automatically.

tag_audio_eventsbooleanOptionalDefaults to true

Whether to tag audio events like (laughter), (footsteps), etc. in the transcription.

num_speakersinteger or nullOptional>=1<=32

The maximum amount of speakers talking in the uploaded file. Can help with predicting who speaks when. The maximum amount of speakers that can be predicted is 32. Defaults to null, in this case the amount of speakers is set to the maximum value the model supports.

timestamps_granularityenumOptionalDefaults to word

The granularity of the timestamps in the transcription. ‘word’ provides word-level timestamps and ‘character’ provides character-level timestamps per word.

Allowed values:

diarizebooleanOptionalDefaults to false

Whether to annotate which speaker is currently talking in the uploaded file.

diarization_thresholddouble or nullOptional>=0.1<=0.4

Diarization threshold to apply during speaker diarization. A higher value means there will be a lower chance of one speaker being diarized as two different speakers but also a higher chance of two different speakers being diarized as one speaker (less total speakers predicted). A low value means there will be a higher chance of one speaker being diarized as two different speakers but also a lower chance of two different speakers being diarized as one speaker (more total speakers predicted). Can only be set when diarize=True and num_speakers=None. Defaults to None, in which case we will choose a threshold based on the model_id (0.22 usually).

additional_formatslist of objectsOptional

A list of additional formats to export the transcript to.

file_formatenumOptionalDefaults to other

The format of input audio. Options are ‘pcm_s16le_16’ or ‘other’ For pcm_s16le_16, the input audio must be 16-bit PCM at a 16kHz sample rate, single channel (mono), and little-endian byte order. Latency will be lower than with passing an encoded waveform.

Allowed values:

cloud_storage_urlstring or nullOptional

The valid AWS S3, Cloudflare R2 or Google Cloud Storage URL of the file to transcribe. Exactly one of the file or cloud_storage_url parameters must be provided. The file must be a valid publicly accessible cloud storage URL. The file size must be less than 2GB. URL can be pre-signed.

webhookbooleanOptionalDefaults to false

Whether to send the transcription result to configured speech-to-text webhooks. If set the request will return early without the transcription, which will be delivered later via webhook.

webhook_idstring or nullOptional

Optional specific webhook ID to send the transcription result to. Only valid when webhook is set to true. If not provided, transcription will be sent to all configured speech-to-text webhooks.

temperaturedouble or nullOptional>=0<=2

Controls the randomness of the transcription output. Accepts values between 0.0 and 2.0, where higher values result in more diverse and less deterministic results. If omitted, we will use a temperature based on the model you selected which is usually 0.

seedinteger or nullOptional>=0<=2147483647

If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed. Must be an integer between 0 and 2147483647.

use_multi_channelbooleanOptionalDefaults to false

Whether the audio file contains multiple channels where each channel contains a single speaker. When enabled, each channel will be transcribed independently and the results will be combined. Each word in the response will include a ‘channel_index’ field indicating which channel it was spoken on. A maximum of 5 channels is supported.

Response

Synchronous transcription result

SpeechToTextChunkResponseModelobject

MultichannelSpeechToTextResponseModelobject

SpeechToTextWebhookResponseModelobject

1	from elevenlabs import ElevenLabs
2
3	client = ElevenLabs(
4	api_key="YOUR_API_KEY",
5	)
6	client.speech_to_text.convert(
7	model_id="model_id",
8	)

1	{
2	"language_code": "en",
3	"language_probability": 0.98,
4	"text": "Hello world!",
5	"words": [
6	{
7	"text": "Hello",
8	"start": 0,
9	"end": 0.5,
10	"type": "word",
11	"speaker_id": "speaker_1",
12	"logprob": -0.124
13	}
14	]
15	}

Headers

Query parameters

Request

Response

Errors