Image & Video

Generate and edit stunning images and videos from text prompts and visual references.

Overview

Image & Video enables you to create high-quality visual content from simple text descriptions or reference images. Generate static images or dynamic videos in any style, then refine them iteratively with additional prompts, upscale for high-resolution output, and even add lip-sync with audio.

This feature is currently in beta.

Key capabilities

  • Image generation: Create high-quality images from text prompts or reference images with models optimized for speed or quality
  • Video generation: Generate dynamic videos with cinematic motion, physics realism, and integrated audio
  • Iterative refinement: Refine generations with additional prompts and create variations
  • Enhancement tools: Upscale resolution by up to 4x and apply realistic lip-sync with audio
  • Multiple models: Access specialized models for different use cases, from rapid iteration to production-ready content
  • Reference support: Guide generation with start frames, end frames, and style references. Supports a wide range of image file formats including JPG, PNG, WEBP, and more
  • Export flexibility: Download as standalone files or import directly into Studio projects

Workflow

The creation process moves you from inspiration to finished asset in four stages:

Explore: Discover community creations to find inspiration and study effective prompts.

Generate: Use the prompt box to describe what you want to create, select a model, and fine-tune settings.

Iterate and enhance: Review generations, create variations, and apply enhancements like upscaling and lip-syncing.

Export: Download finished assets or send them directly to Studio.

Supported download formats

Video:

  • MP4: Codecs H.264, H.265. Quality up to 4K (with upscaling)

Image:

  • PNG: High-resolution, lossless output

Models

Image & Video provides access to specialized models optimized for different use cases. Each model offers unique capabilities, from rapid iteration to production-ready quality.

Post-processing models require an existing generated output, though you can also upload your own image or video file.

The most advanced, high-fidelity video model for cinematic results at your disposal.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Highest-fidelity, professional-grade output with synced audio
  • Precise multi-shot control
  • Excels at complex motion and prompt adherence
  • Fixed durations: 4s, 8s, and 12s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Cinematic, professional-grade video content

Cost: Starts at 12,000 credits for a generation

End frame is not currently supported. Cannot provide image references. Sound is enabled by default.

The standard, high-speed version of OpenAI’s advanced video model, tuned for everyday content creation.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Realistic, physics-aware videos with synced audio
  • Fine scene control
  • Fixed durations: 4s, 8s, and 12s
  • Batch creation with up to 4 generations at a time
  • Strong narrative and character consistency

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Everyday content creation with realistic physics

Cost: Starts at 4,000 credits for default settings

End frame is not currently supported. Cannot provide image references. Sound is enabled by default.

A professional-grade model for high-quality, cinematic video generation.

Generation inputs:

  • Text-to-Video
  • Start Frame
  • End Frame
  • Image References

Features:

  • Excellent quality and creative control with negative prompts
  • Fully integrated and synchronized audio
  • Realistic dialogue, lip-sync, and sound effects
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time
  • Dedicated sound control

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • High-quality, cinematic video generation with full creative control

Cost: Starts at 8,000 credits for default settings

Enabling and disabling sound will change the generation credits.

A balanced and versatile model for high-quality, full-HD video generation.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Excels at simulating complex motion and realistic physics
  • Accurately models fluid dynamics and expressions
  • Fixed durations: 5s and 10s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 1080p
  • Aspect ratios: 16:9, 1:1, 9:16

Ideal for:

  • Realistic physics simulations and complex motion

Cost: Starts at 3,500 credits for default settings

End frame is not currently supported. Cannot provide image references. Sound control not available.

A high-speed model optimized for rapid previews and generations, delivering sharper visuals with lower latency.

Generation inputs:

  • Text-to-Video
  • Start Frame
  • End Frame

Features:

  • Advanced creative control with negative prompts and dedicated sound control
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time
  • Accurately models real-world physics for realistic motion and interactions

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Quick iteration and A/B testing visuals
  • Fast-paced social media content creation

Cost: Starts at 4,000 credits for default settings

Production-ready model delivering exceptional quality, strong physics realism, and coherent narrative audio.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Advanced integrated “narrative audio” generation that matches video tone and story
  • Granular creative control with negative prompts and dedicated sound control
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Final renders and professional marketing content
  • Short-form storytelling

Cost: Starts at 8,000 credits for default settings

A high-speed, cost-efficient model for generating audio-backed video from text or a starting image.

Generation inputs:

  • Text-to-Video
  • Start Frame

Features:

  • Granular creative control with negative prompts and dedicated sound control
  • Fixed durations: 4s, 6s, and 8s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 720p, 1080p
  • Aspect ratios: 16:9, 9:16

Ideal for:

  • Rapid iteration and previews
  • Cost-effective content creation

Cost: Starts at 4,000 credits for default settings

A specialized model for creating dynamic, multi-shot sequences with large movement and action.

Generation inputs:

  • Text-to-Video
  • Start Frame
  • End Frame

Features:

  • Highly stable physics and seamless transitions between shots
  • Fixed durations: 3s, 4s, 5s, 6s, 7s, 8s, 9s, 10s, 11s, and 12s
  • Batch creation with up to 4 generations at a time
  • Maximum creative flexibility with numerous aspect ratio options

Output options:

  • Resolutions: 480p, 720p, 1080p
  • Aspect ratios: 21:9, 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

  • Storytelling and action scenes requiring stable physics

Cost: Starts at 4,800 credits for default settings

Aspect ratio and resolution do not affect generation credits, but duration does.

A versatile model that delivers cinematic motion and high prompt fidelity from text or a starting image.

Generation inputs:

  • Text-to-Video
  • Start Frame (Image-to-Video)

Features:

  • Granular creative control with negative prompts and dedicated sound control
  • Fixed durations: 5s and 10s
  • Batch creation with up to 4 generations at a time

Output options:

  • Resolutions: 480p, 720p, 1080p
  • Aspect ratios: 16:9, 1:1, 9:16

Ideal for:

  • Cinematic content with strong prompt adherence

Cost: Starts at 2,500 credits for default settings

Generation cost varies based on selected settings.

A high-speed model for quick, high-quality image generation and editing directly from text prompts.

Features:

  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 21:9, 16:9, 5:4, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16

Ideal for:

  • Rapid image creation and iteration

Cost: Starts at 2,000 credits for default settings; varies based on number of generations

A specialized image model for generating multi-shot sequences or scenes with large movement and action.

Features:

  • Excels at creating images with stable physics and coherent transitions
  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: auto, 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

  • Action scenes and dynamic compositions

Cost: Starts at 1,200 credits for default settings; varies based on number of generations

A professional model for advanced image generation and editing, offering strong scene coherence and style control.

Features:

  • Image-based style control requiring a reference image to guide visual aesthetic
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 21:9, 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16, 9:21

Ideal for:

  • Professional content with precise style requirements

Cost: Starts at 1,600 credits; varies based on settings and number of generations

An image model with strong prompt fidelity and motion awareness, ideal for capturing dynamic action in a still frame.

Features:

  • Granular control with negative prompts
  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16

Ideal for:

  • Dynamic still images with motion awareness

Cost: Starts at 2,000 credits; varies based on settings

A versatile model for precise, high-quality image creation and detailed editing guided by natural language prompts.

Features:

  • Supports multiple image references to guide generation
  • Generates up to 4 images at a time

Output options:

  • Aspect ratios: 3:2, 1:1, 2:3
  • Quality options: low, medium, high

Ideal for:

  • Creating and editing images with precise, text-based control

Cost: Starts at 2,400 credits for default settings; varies based on settings and number of generations

A dedicated utility model for generating exceptionally realistic, humanlike lip-sync.

Inputs:

  • Static source image
  • Speech audio file

Features:

  • Animates the mouth on the source image to match provided audio
  • Creates high-fidelity “talking” video from still images
  • Lip-sync specific tool, not a full video generation model

Ideal for:

  • Creating talking avatars
  • Adding dialogue to still images
  • Professional dubbing workflows

Cost: Depends on generation input

For best results, the image should contain a detectable figure.

A fast, affordable, and precise utility model for applying realistic lip-sync to videos.

Inputs:

  • Source video
  • New speech audio file

Features:

  • Re-animates mouth movements in source video to match new audio
  • Video-to-video lip-sync tool, not a full video generator

Ideal for:

  • High-volume, cost-effective dubbing
  • Translating content
  • Correcting audio in video clips with realistic results

Cost: Depends on generation input

For best results, the video should contain a detectable figure.

A dedicated utility model for image and video upscaling, designed to enhance resolution and detail up to 4x.

Features:

  • Enhancement tool that processes existing media
  • Increases media size while preserving natural textures and minimizing artifacts
  • Highly granular upscale factors: 1x, 1.25x, 1.5x, 1.75x, 2x, 3x, 4x
  • Video-specific: Flexible frame rate control (keep source or convert to 24, 25, 30, 48, 50, or 60 fps)

Ideal for:

  • Improving quality of generated media
  • Restoring legacy footage or photos
  • Preparing assets for high-resolution displays

Cost: Depends on generation input