Image & Video

Complete guide to creating and editing images and videos in ElevenLabs.

Image & Video interface

Overview

Image & Video enables you to create high-quality visual content from simple text descriptions or reference images. Generate static images or dynamic videos in any style, then refine them iteratively with additional prompts, upscale for high-resolution output, and even add lip-sync with audio. Export finished assets as standalone files or import them directly into Studio projects.

This feature is currently in beta.

Guide

Follow these steps to create your first visual asset:

1

Select your mode

Use the toggle in the upper right corner of the prompt box to choose between Image or Video generation.

2

Provide a prompt or reference

Describe your desired output using natural language in the prompt box. For more control, drag existing images or videos from the Explore or History tabs into the reference slots, or upload your own reference images in a wide range of file formats including JPG, PNG, WEBP, and more.

3

Choose a model and settings

Select the ideal generative model for your goal (e.g., OpenAI Sora 2 Pro, Google Veo 3.1, Kling 2.5, Flux 1 Kontext Pro). See the Models section for detailed information on each model. Adjust settings like aspect ratio, resolution, duration (for video), and the number of variations to generate.

4

Generate your asset

Click the Generate button. Your assets will be created and displayed in the History tab for review.

5

Enhance and refine

Use enhancement tools to perfect your media. Upscale the resolution, apply realistic LipSync with audio, or click Recreate to generate a new variation with the same settings.

6

Share with others

Click the Share button to generate a unique link for your creation. Send it to teammates and collaborators to collect feedback.

7

Export your creation

Download the asset as a standalone file or import it directly into a Studio project.

Workflow

The creation process moves you from inspiration to finished asset in four stages:

Explore

Discover community creations to find inspiration, study effective prompts, or pull references directly into your own work.

Generate

Use the prompt box to describe what you want to create, select a model, fine-tune your settings, and bring your idea to life.

History

Review your generations in the History tab to iterate and enhance. Recreate variations, reuse prompts, and apply enhancements like upscaling and lip-syncing.

Export

Download finished assets in various formats or send them directly to Studio to use in your projects.

Explore

The Explore tab displays a gallery of community creations for discovering inspiration and finding visuals to use as references.

Search: Use the search bar to find images and videos based on keywords.

Sort: Toggle between Trending and Newest to see what’s popular or recently added.

Drag-and-drop: Pull any result from the grid directly into the prompt box to use as a start frame, end frame, or style reference.

Preview details: Click any tile to see the full prompt and settings used to create it.

Generate

Video prompt interface

The prompt box is anchored at the bottom of the page and provides all controls for creating visual content.

Set mode and prompt

Select mode: Use the toggle in the upper right corner to switch between Image and Video generation.

Write your prompt: In the main field, describe what you want to generate using natural language. Be clear and descriptive for best results.

Choose models and settings

Video models selection

Select model: Open the model menu to browse available options like OpenAI Sora 2 Pro, Google Veo 3.1, Kling 2.5, or Flux 1 Kontext Pro. Each model has unique strengths and capabilities listed for easy comparison. See the Models section for detailed information.

Adjust settings: Fine-tune your generation with settings that appear below the prompt. These vary by model but often include:

  • Aspect Ratio: Control the dimensions of your output
  • Resolution: Set the quality level
  • Duration: Specify video length (for video mode)
  • Number of Generations: Create up to 4 variations at once

Use controls: On supported models, enable Audio, add a Negative Prompt to exclude unwanted elements, or adjust Sound Control.

Add references

Video references
interface

For greater control over output, add visual references to guide generation. Availability depends on the selected model. We support a wide range of image file formats including JPG, PNG, WEBP, and more.

Start Frame (Video): Sets the opening image of your video.

End Frame (Video): Sets the final image, influencing the transition.

Image Refs (Image or Video): Provide one or more images to guide overall style and look.

Drag and drop items directly from the Explore or History tabs into reference slots for a faster workflow.

Generate

Before generating, a cost indicator shows the total cost for the number of assets you’ve chosen to create. When ready, click Generate. Your new creations will appear in the History tab.

History

Video history interface

The History tab provides a chronological log of everything you’ve generated and serves as a workspace for refining previous work.

Browse: View all past images and videos.

Inspect: Click any asset to see the original prompt, model, and settings used to create it.

Reuse: Drag items from History back into the prompt box to use as references for new generations.

Iterate: Click Recreate to run the same prompt and settings again for a new variation, or adjust settings to guide generation in a new direction.

Share: Click Share to generate a unique link for your asset. Send it to teammates and collaborators for feedback.

Export: Download your asset as a standalone file or click Edit in Studio to import it directly into Studio.

Export

Once you have a generation you’re satisfied with, use built-in enhancement tools before exporting.

Enhancing your creations

Upscale: Use Topaz Upscale to increase resolution by up to 4x while preserving sharp details.

LipSync: Apply realistic lip-syncing to your visuals:

  • Omnihuman 1.5: Animate a static image with an audio track
  • Veed LipSync: Dub an existing video with new audio

Exporting your assets

Video export interface

Export finished assets by downloading them locally or sending them directly to Studio.

Edit in Studio: Import the asset directly into a Studio project.

Download: Save the asset to your local machine.

Supported download formats

Video:

  • MP4: Codecs H.264, H.265. Quality up to 4K (with upscaling)

Image:

  • PNG: High-resolution, lossless output

Models

Image & Video provides access to specialized models optimized for different use cases. Each model offers unique capabilities, from rapid iteration to production-ready quality.

Post-processing models require an existing generated output, though you can also upload your own image or video file.

The most advanced, high-fidelity video model for cinematic results at your disposal.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Highest-fidelity, professional-grade output with synced audio
  • Precise multi-shot control
  • Excels at complex motion and prompt adherence
  • Fixed durations: 4s, 8s, and 12s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Cinematic, professional-grade video content

Cost: Starts at 12,000 credits for a generation

End frame is not currently supported. Cannot provide image references. Sound is enabled by default.

The standard, high-speed version of OpenAI’s advanced video model, tuned for everyday content creation.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Realistic, physics-aware videos with synced audio
  • Fine scene control
  • Fixed durations: 4s, 8s, and 12s
  • Batch creation with up to 4 generations at a time
  • Strong narrative and character consistency

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Everyday content creation with realistic physics

Cost: Starts at 4,000 credits for default settings

End frame is not currently supported. Cannot provide image references. Sound is enabled by default.

A professional-grade model for high-quality, cinematic video generation.

Generation inputs:

  • Text-to-Video
  • Start Frame
  • End Frame
  • Image References

Features:

  • Excellent quality and creative control with negative prompts
  • Fully integrated and synchronized audio
  • Realistic dialogue, lip-sync, and sound effects
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time
  • Dedicated sound control

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • High-quality, cinematic video generation with full creative control

Cost: Starts at 8,000 credits for default settings

Enabling and disabling sound will change the generation credits.

A balanced and versatile model for high-quality, full-HD video generation.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Excels at simulating complex motion and realistic physics
  • Accurately models fluid dynamics and expressions
  • Fixed durations: 5s and 10s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 1080p
  • Aspect ratios: 16:9, 1:1, 9:16

Ideal for:

  • Realistic physics simulations and complex motion

Cost: Starts at 3,500 credits for default settings

End frame is not currently supported. Cannot provide image references. Sound control not available.

A high-speed model optimized for rapid previews and generations, delivering sharper visuals with lower latency.

Generation inputs:

  • Text-to-Video
  • Start Frame
  • End Frame

Features:

  • Advanced creative control with negative prompts and dedicated sound control
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time
  • Accurately models real-world physics for realistic motion and interactions

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Quick iteration and A/B testing visuals
  • Fast-paced social media content creation

Cost: Starts at 4,000 credits for default settings

Production-ready model delivering exceptional quality, strong physics realism, and coherent narrative audio.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Advanced integrated “narrative audio” generation that matches video tone and story
  • Granular creative control with negative prompts and dedicated sound control
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Final renders and professional marketing content
  • Short-form storytelling

Cost: Starts at 8,000 credits for default settings

A high-speed, cost-efficient model for generating audio-backed video from text or a starting image.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Granular creative control with negative prompts and dedicated sound control
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Rapid iteration and previews
  • Cost-effective content creation

Cost: Starts at 4,000 credits for default settings

A specialized model for creating dynamic, multi-shot sequences with large movement and action.

Generation inputs:

  • Text-to-Video
  • Start Frame
  • End Frame

Features:

  • Highly stable physics and seamless transitions between shots
  • Fixed durations: 3s, 4s, 5s, 6s, 7s, 8s, 9s, 10s, 11s, and 12s
  • Batch creation with up to 4 generations at a time
  • Maximum creative flexibility with numerous aspect ratio options

Output options:

  • Resolutions: 480p, 720p, 1080p
  • Aspect ratios: 21:9, 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

  • Storytelling and action scenes requiring stable physics

Cost: Starts at 4,800 credits for default settings

Aspect ratio and resolution do not affect generation credits, but duration does.

A versatile model that delivers cinematic motion and high prompt fidelity from text or a starting image.

Generation inputs:

  • Text-to-Video
  • Start Frame (Image-to-Video)

Features:

  • Granular creative control with negative prompts and dedicated sound control
  • Fixed durations: 5s and 10s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 480p, 720p, 1080p
  • Aspect ratios: 16:9, 1:1, 9:16

Ideal for:

  • Cinematic content with strong prompt adherence

Cost: Starts at 2,500 credits for default settings

Generation cost varies based on selected settings.

A high-speed model for quick, high-quality image generation and editing directly from text prompts.

Features:

  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 21:9, 16:9, 5:4, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16

Ideal for:

  • Rapid image creation and iteration

Cost: Starts at 2,000 credits for default settings; varies based on number of generations

A specialized image model for generating multi-shot sequences or scenes with large movement and action.

Features:

  • Excels at creating images with stable physics and coherent transitions
  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: auto, 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

  • Action scenes and dynamic compositions

Cost: Starts at 1,200 credits for default settings; varies based on number of generations

A professional model for advanced image generation and editing, offering strong scene coherence and style control.

Features:

  • Image-based style control requiring a reference image to guide visual aesthetic
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 21:9, 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16, 9:21

Ideal for:

  • Professional content with precise style requirements

Cost: Starts at 1,600 credits; varies based on settings and number of generations

An image model with strong prompt fidelity and motion awareness, ideal for capturing dynamic action in a still frame.

Features:

  • Granular control with negative prompts
  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

  • Dynamic still images with motion awareness

Cost: Starts at 2,000 credits; varies based on settings

A versatile model for precise, high-quality image creation and detailed editing guided by natural language prompts.

Features:

  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 3:2, 1:1, 2:3
  • Quality options: low, medium, high

Ideal for:

  • Creating and editing images with precise, text-based control

Cost: Starts at 2,400 credits for default settings; varies based on settings and number of generations

A dedicated utility model for generating exceptionally realistic, humanlike lip-sync.

Inputs:

  • Static source image
  • Speech audio file

Features:

  • Animates the mouth on the source image to match provided audio
  • Creates high-fidelity “talking” video from still images
  • Lip-sync specific tool, not a full video generation model

Ideal for:

  • Creating talking avatars
  • Adding dialogue to still images
  • Professional dubbing workflows

Cost: Depends on generation input

For best results, the image should contain a detectable figure.

A fast, affordable, and precise utility model for applying realistic lip-sync to videos.

Inputs:

  • Source video
  • New speech audio file

Features:

  • Re-animates mouth movements in source video to match new audio
  • Video-to-video lip-sync tool, not a full video generator

Ideal for:

  • High-volume, cost-effective dubbing
  • Translating content
  • Correcting audio in video clips with realistic results

Cost: Depends on generation input

For best results, the video should contain a detectable figure.

A dedicated utility model for image and video upscaling, designed to enhance resolution and detail up to 4x.

Features:

  • Enhancement tool that processes existing media
  • Increases media size while preserving natural textures and minimizing artifacts
  • Highly granular upscale factors: 1x, 1.25x, 1.5x, 1.75x, 2x, 3x, 4x
  • Video-specific: Flexible frame rate control (keep source or convert to 24, 25, 30, 48, 50, or 60 fps)

Ideal for:

  • Improving quality of generated media
  • Restoring legacy footage or photos
  • Preparing assets for high-resolution displays

Cost: Depends on generation input