Forced Alignment

Learn how to turn spoken audio and text into a time-aligned transcript with ElevenLabs.

Overview

The ElevenLabs Forced Alignment API turns spoken audio and text into a time-aligned transcript. This is useful for cases where you have audio recording and a transcript, but need exact timestamps for each word or phrase in the transcript. This can be used for:

  • Matching subtitles to a video recording
  • Generating timings for an audiobook recording of an ebook

Usage

The Forced Alignment API can be used by interfacing with the ElevenLabs API directly.

Supported languages

Our v2 models support 29 languages:

English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.

FAQ

Forced alignment is a technique used to align spoken audio with text. It’s useful for cases where you have audio recording and a transcript, but need exact timestamps for each word or phrase in the transcript.

The input text should be a string with no special formatting i.e. JSON.

Example of good input text:

"Hello, how are you?"

Example of bad input text:

{
"text": "Hello, how are you?"
}

Forced Alignment costs the same as the Speech to Text API.

Forced Alignment does not support diarization. If you provide diarized text, the API will likely return unwanted results.

The maximum file size for Forced Alignment is 1GB.

For audio and video files, the maximum duration is 4.5 hours.

For the text input, the maximum length is 100,000,000 (one hundred million) characters.