Forced Alignment

Learn how to turn spoken audio and text into a time-aligned transcript with ElevenLabs.

Overview

The ElevenLabs Forced Alignment API turns spoken audio and text into a time-aligned transcript. This is useful for cases where you have audio recording and a transcript, but need exact timestamps for each word or phrase in the transcript. This can be used for:

  • Matching subtitles to a video recording
  • Generating timings for an audiobook recording of an ebook

Usage

The Forced Alignment API can be used by interfacing with the ElevenLabs API directly.

Supported languages

Our multilingual v2 models support 29 languages:

English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.

Key facts

  • Input text format: Plain string only — do not wrap input text in JSON or any other structure
  • Diarization: Not supported; providing diarized text will produce unexpected results
  • Pricing: Same rate as the Speech to Text API
  • Maximum file size: 3 GB
  • Maximum audio duration: 10 hours
  • Maximum text length: 675,000 characters