Forced Alignment
Learn how to turn spoken audio and text into a time-aligned transcript with ElevenLabs.
Overview
The ElevenLabs Forced Alignment API turns spoken audio and text into a time-aligned transcript. This is useful for cases where you have audio recording and a transcript, but need exact timestamps for each word or phrase in the transcript. This can be used for:
- Matching subtitles to a video recording
- Generating timings for an audiobook recording of an ebook
Usage
The Forced Alignment API can be used by interfacing with the ElevenLabs API directly.
Supported languages
Our v2 models support 29 languages:
English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.
FAQ
What is forced alignment?
Forced alignment is a technique used to align spoken audio with text. It’s useful for cases where you have audio recording and a transcript, but need exact timestamps for each word or phrase in the transcript.
What text input formats are supported?
The input text should be a string with no special formatting i.e. JSON.
Example of good input text:
Example of bad input text:
How much does Forced Alignment cost?
Forced Alignment costs the same as the Speech to Text API.
Does Forced Alignment support diarization?
Forced Alignment does not support diarization. If you provide diarized text, the API will likely return unwanted results.
What is the maximum audio or video file size for Forced Alignment?
The maximum file size for Forced Alignment is 1GB.
What is the maximum duration for a Forced Alignment input file?
For audio and video files, the maximum duration is 4.5 hours.
For the text input, the maximum length is 100,000,000 (one hundred million) characters.