Image & Video | ElevenLabs Documentation

Overview

Image & Video enables you to create high-quality visual content from simple text descriptions or reference images. Generate static images or dynamic videos in any style, then refine them iteratively with additional prompts, upscale for high-resolution output, and even add lip-sync with audio. Export finished assets as standalone files or import them directly into Studio projects.

This feature is currently in beta.

Free plan users can only generate images. Video generation requires a paid plan.

Guide

Follow these steps to create your first visual asset:

Select your mode

Use the toggle in the upper right corner of the prompt box to choose between Image or Video generation.

Provide a prompt or reference

Describe your desired output using natural language in the prompt box. For more control, drag existing images or videos from the Explore or History tabs into the reference slots, or upload your own reference images in a wide range of file formats including JPG, PNG, WEBP, and more.

Choose a model and settings

Select the ideal generative model for your goal (e.g., OpenAI Sora 2 Pro, Google Veo 3.1, Kling 2.5, Flux 1 Kontext Pro). See the Models section for detailed information on each model. Adjust settings like aspect ratio, resolution, duration (for video), and the number of variations to generate.

Generate your asset

Click the Generate button. Your assets will be created and displayed in the History tab for review.

Enhance and refine

Use enhancement tools to perfect your media. Upscale the resolution, apply realistic LipSync with audio, or click Recreate to generate a new variation with the same settings.

Export your creation

Download the asset as a standalone file or import it directly into a Studio project.

Workflow

The creation process moves you from inspiration to finished asset in four stages:

Explore

Discover community creations to find inspiration, study effective prompts, or pull references directly into your own work.

Generate

Use the prompt box to describe what you want to create, select a model, fine-tune your settings, and bring your idea to life.

History

Review your generations in the History tab to iterate and enhance. Recreate variations, reuse prompts, and apply enhancements like upscaling and lip-syncing.

Export

Download finished assets in various formats or send them directly to Studio to use in your projects.

Explore

The Explore tab displays a gallery of community creations for discovering inspiration and finding visuals to use as references.

Search: Use the search bar to find images and videos based on keywords.

Sort: Toggle between Trending and Newest to see what’s popular or recently added.

Drag-and-drop: Pull any result from the grid directly into the prompt box to use as a start frame, end frame, or style reference.

Preview details: Click any tile to see the full prompt and settings used to create it.

Generate

The prompt box is anchored at the bottom of the page and provides all controls for creating visual content.

Set mode and prompt

Select mode: Use the toggle in the upper right corner to switch between Image and Video generation.

Write your prompt: In the main field, describe what you want to generate using natural language. Be clear and descriptive for best results.

Choose models and settings

Select model: Open the model menu to browse available options like OpenAI Sora 2 Pro, Google Veo 3.1, Kling 2.5, or Flux 1 Kontext Pro. Each model has unique strengths and capabilities listed for easy comparison. See the Models section for detailed information.

Adjust settings: Fine-tune your generation with settings that appear below the prompt. These vary by model but often include:

Aspect Ratio: Control the dimensions of your output
Resolution: Set the quality level
Duration: Specify video length (for video mode)
Number of Generations: Create up to 4 variations at once

Use controls: On supported models, enable Audio, add a Negative Prompt to exclude unwanted elements, or adjust Sound Control.

Add references

For greater control over output, add visual references to guide generation. Availability depends on the selected model. We support a wide range of image file formats including JPG, PNG, WEBP, and more.

Start Frame (Video): Sets the opening image of your video.

End Frame (Video): Sets the final image, influencing the transition.

Image Refs (Image or Video): Provide one or more images to guide overall style and look.

Drag and drop items directly from the Explore or History tabs into reference slots for a faster workflow.

Generate

Before generating, a cost indicator shows the total cost for the number of assets you’ve chosen to create. When ready, click Generate. Your new creations will appear in the History tab.

History

The History tab provides a chronological log of everything you’ve generated and serves as a workspace for refining previous work.

Browse: View all past images and videos.

Inspect: Click any asset to see the original prompt, model, and settings used to create it.

Reuse: Drag items from History back into the prompt box to use as references for new generations.

Iterate: Click Recreate to run the same prompt and settings again for a new variation, or adjust settings to guide generation in a new direction.

Share: Click Share to generate a unique link for your asset. Send it to teammates and collaborators for feedback.

Export: Download your asset as a standalone file or click Edit in Studio to import it directly into Studio.

Export

Once you have a generation you’re satisfied with, use built-in enhancement tools before exporting.

Enhancing your creations

Upscale: Use Topaz Upscale to increase resolution by up to 4x while preserving sharp details.

LipSync: Apply realistic lip-syncing to your visuals:

Omnihuman 1.5: Animate a static image with an audio track
Veed LipSync: Dub an existing video with new audio

Exporting your assets

Export finished assets by downloading them locally or sending them directly to Studio.

Edit in Studio: Import the asset directly into a Studio project.

Download: Save the asset to your local machine.

Supported download formats

Video:

MP4: Codecs H.264, H.265. Quality up to 4K (with upscaling)

Image:

PNG: High-resolution, lossless output

Models

Image & Video provides access to specialized models optimized for different use cases. Each model offers unique capabilities, from rapid iteration to production-ready quality.

Post-processing models require an existing generated output, though you can also upload your own image or video file.

Video generative models

OpenAI Sora 2 Pro

The most advanced, high-fidelity video model for cinematic results at your disposal.

Generation inputs:

Text-to-Video
Start Frame

Features:

Highest-fidelity, professional-grade output with synced audio
Precise multi-shot control
Excels at complex motion and prompt adherence
Fixed durations: 4s, 8s, and 12s
Batch creation with up to 4 generations at a time

Output options:

Resolutions: 720p, 1080p
Aspect ratios: 16:9, 9:16

Ideal for:

Cinematic, professional-grade video content

Cost: Starts at 12,000 credits for a generation

End frame is not currently supported. Cannot provide image references. Sound is enabled by default.

OpenAI Sora 2

The standard, high-speed version of OpenAI’s advanced video model, tuned for everyday content creation.

Generation inputs:

Text-to-Video
Start Frame

Features:

Realistic, physics-aware videos with synced audio
Fine scene control
Fixed durations: 4s, 8s, and 12s
Batch creation with up to 4 generations at a time
Strong narrative and character consistency

Output options:

Resolutions: 720p, 1080p
Aspect ratios: 16:9, 9:16

Ideal for:

Everyday content creation with realistic physics

Cost: Starts at 4,000 credits for default settings

End frame is not currently supported. Cannot provide image references. Sound is enabled by default.

Google Veo 3.1

A professional-grade model for high-quality, cinematic video generation.

Generation inputs:

Text-to-Video
Start Frame
End Frame
Image References

Features:

Excellent quality and creative control with negative prompts
Fully integrated and synchronized audio
Realistic dialogue, lip-sync, and sound effects
Fixed durations: 4s, 6s, and 8s
Batch creation with up to 4 generations at a time
Dedicated sound control

Output options:

Resolutions: 720p, 1080p
Aspect ratios: 16:9, 9:16

Ideal for:

High-quality, cinematic video generation with full creative control

Cost: Starts at 8,000 credits for default settings

Enabling and disabling sound will change the generation credits.

Kling 2.5

A balanced and versatile model for high-quality, full-HD video generation.

Generation inputs:

Text-to-Video
Start Frame

Features:

Excels at simulating complex motion and realistic physics
Accurately models fluid dynamics and expressions
Fixed durations: 5s and 10s
Batch creation with up to 4 generations at a time

Output options:

Resolutions: 1080p
Aspect ratios: 16:9, 1:1, 9:16

Ideal for:

Realistic physics simulations and complex motion

Cost: Starts at 3,500 credits for default settings

End frame is not currently supported. Cannot provide image references. Sound control not available.

Google Veo 3.1 Fast

A high-speed model optimized for rapid previews and generations, delivering sharper visuals with lower latency.

Generation inputs:

Text-to-Video
Start Frame
End Frame

Features:

Advanced creative control with negative prompts and dedicated sound control
Fixed durations: 4s, 6s, and 8s
Batch creation with up to 4 generations at a time
Accurately models real-world physics for realistic motion and interactions

Output options:

Resolutions: 720p, 1080p
Aspect ratios: 16:9, 9:16

Ideal for:

Quick iteration and A/B testing visuals
Fast-paced social media content creation

Cost: Starts at 4,000 credits for default settings

Google Veo 3

Production-ready model delivering exceptional quality, strong physics realism, and coherent narrative audio.

Generation inputs:

Text-to-Video
Start Frame

Features:

Advanced integrated “narrative audio” generation that matches video tone and story
Granular creative control with negative prompts and dedicated sound control
Fixed durations: 4s, 6s, and 8s
Batch creation with up to 4 generations at a time

Output options:

Resolutions: 720p, 1080p
Aspect ratios: 16:9, 9:16

Ideal for:

Final renders and professional marketing content
Short-form storytelling

Cost: Starts at 8,000 credits for default settings

Google Veo 3 Fast

A high-speed, cost-efficient model for generating audio-backed video from text or a starting image.

Generation inputs:

Text-to-Video
Start Frame

Features:

Granular creative control with negative prompts and dedicated sound control
Fixed durations: 4s, 6s, and 8s
Batch creation with up to 4 generations at a time

Output options:

Resolutions: 720p, 1080p
Aspect ratios: 16:9, 9:16

Ideal for:

Rapid iteration and previews
Cost-effective content creation

Cost: Starts at 4,000 credits for default settings

Seedance 1 Pro

A specialized model for creating dynamic, multi-shot sequences with large movement and action.

Generation inputs:

Text-to-Video
Start Frame
End Frame

Features:

Highly stable physics and seamless transitions between shots
Fixed durations: 3s, 4s, 5s, 6s, 7s, 8s, 9s, 10s, 11s, and 12s
Batch creation with up to 4 generations at a time
Maximum creative flexibility with numerous aspect ratio options

Output options:

Resolutions: 480p, 720p, 1080p
Aspect ratios: 21:9, 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

Storytelling and action scenes requiring stable physics

Cost: Starts at 4,800 credits for default settings

Aspect ratio and resolution do not affect generation credits, but duration does.

Wan 2.5

A versatile model that delivers cinematic motion and high prompt fidelity from text or a starting image.

Generation inputs:

Text-to-Video
Start Frame (Image-to-Video)

Features:

Granular creative control with negative prompts and dedicated sound control
Fixed durations: 5s and 10s
Batch creation with up to 4 generations at a time

Output options:

Resolutions: 480p, 720p, 1080p
Aspect ratios: 16:9, 1:1, 9:16

Ideal for:

Cinematic content with strong prompt adherence

Cost: Starts at 2,500 credits for default settings

Generation cost varies based on selected settings.

Image generative models

Google Nano Banana

A high-speed model for quick, high-quality image generation and editing directly from text prompts.

Features:

Supports multiple image references to guide generation
Generates up to 4 images at a time

Output options:

Aspect ratios: 21:9, 16:9, 5:4, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16

Ideal for:

Rapid image creation and iteration

Cost: Starts at 2,000 credits for default settings; varies based on number of generations

Seedream 4

A specialized image model for generating multi-shot sequences or scenes with large movement and action.

Features:

Excels at creating images with stable physics and coherent transitions
Supports multiple image references to guide generation
Generates up to 4 images at a time

Output options:

Aspect ratios: auto, 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

Action scenes and dynamic compositions

Cost: Starts at 1,200 credits for default settings; varies based on number of generations

Flux 1 Kontext Pro

A professional model for advanced image generation and editing, offering strong scene coherence and style control.

Features:

Image-based style control requiring a reference image to guide visual aesthetic
Generates up to 4 images at a time

Output options:

Aspect ratios: 21:9, 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16, 9:21

Ideal for:

Professional content with precise style requirements

Cost: Starts at 1,600 credits; varies based on settings and number of generations

Wan 2.5

An image model with strong prompt fidelity and motion awareness, ideal for capturing dynamic action in a still frame.

Features:

Granular control with negative prompts
Supports multiple image references to guide generation
Generates up to 4 images at a time

Output options:

Aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

Dynamic still images with motion awareness

Cost: Starts at 2,000 credits; varies based on settings

OpenAI GPT Image 1

A versatile model for precise, high-quality image creation and detailed editing guided by natural language prompts.

Features:

Supports multiple image references to guide generation
Generates up to 4 images at a time

Output options:

Aspect ratios: 3:2, 1:1, 2:3
Quality options: low, medium, high

Ideal for:

Creating and editing images with precise, text-based control

Cost: Starts at 2,400 credits for default settings; varies based on settings and number of generations

Lip-sync models

Omnihuman 1.5

A dedicated utility model for generating exceptionally realistic, humanlike lip-sync.

Inputs:

Static source image
Speech audio file

Features:

Animates the mouth on the source image to match provided audio
Creates high-fidelity “talking” video from still images
Lip-sync specific tool, not a full video generation model

Ideal for:

Creating talking avatars
Adding dialogue to still images
Professional dubbing workflows

Cost: Depends on generation input

For best results, the image should contain a detectable figure.

Veed LipSync

A fast, affordable, and precise utility model for applying realistic lip-sync to videos.

Inputs:

Source video
New speech audio file

Features:

Re-animates mouth movements in source video to match new audio
Video-to-video lip-sync tool, not a full video generator

Ideal for:

High-volume, cost-effective dubbing
Translating content
Correcting audio in video clips with realistic results

Cost: Depends on generation input

For best results, the video should contain a detectable figure.

Upscaling model

Topaz Upscale

A dedicated utility model for image and video upscaling, designed to enhance resolution and detail up to 4x.

Features:

Enhancement tool that processes existing media
Increases media size while preserving natural textures and minimizing artifacts
Highly granular upscale factors: 1x, 1.25x, 1.5x, 1.75x, 2x, 3x, 4x
Video-specific: Flexible frame rate control (keep source or convert to 24, 25, 30, 48, 50, or 60 fps)

Ideal for:

Improving quality of generated media
Restoring legacy footage or photos
Preparing assets for high-resolution displays

Cost: Depends on generation input

Overview

Guide

Select your mode

Provide a prompt or reference

Choose a model and settings

Generate your asset

Enhance and refine

Share with others

Export your creation

Workflow

Explore

Generate

History

Export

Explore

Generate

Set mode and prompt

Choose models and settings

Add references

Generate

History

Export

Enhancing your creations

Exporting your assets

Supported download formats

Models

Video generative models

OpenAI Sora 2 Pro

OpenAI Sora 2

Google Veo 3.1

Kling 2.5

Google Veo 3.1 Fast

Google Veo 3

Google Veo 3 Fast

Seedance 1 Pro

Wan 2.5

Image generative models

Google Nano Banana

Seedream 4

Flux 1 Kontext Pro

Wan 2.5

OpenAI GPT Image 1

Lip-sync models

Omnihuman 1.5

Veed LipSync

Upscaling model

Topaz Upscale