Image & Video
Overview
Image & Video enables you to create high-quality visual content from simple text descriptions or reference images. Generate static images or dynamic videos in any style, then refine them iteratively with additional prompts, upscale for high-resolution output, and even add lip-sync with audio.
Key capabilities
- Image generation: Create high-quality images from text prompts or reference images with models optimized for speed or quality
- Video generation: Generate dynamic videos with cinematic motion, physics realism, and integrated audio
- Iterative refinement: Refine generations with additional prompts and create variations
- Enhancement tools: Upscale resolution by up to 4x and apply realistic lip-sync with audio
- Multiple models: Access specialized models for different use cases, from rapid iteration to production-ready content
- Reference support: Guide generation with start frames, end frames, and style references. Supports a wide range of image file formats including JPG, PNG, WEBP, and more
- Export flexibility: Download as standalone files or import directly into Studio projects
Workflow
The creation process moves you from inspiration to finished asset in four stages:
Explore: Discover community creations to find inspiration and study effective prompts.
Generate: Use the prompt box to describe what you want to create, select a model, and fine-tune settings.
Iterate and enhance: Review generations, create variations, and apply enhancements like upscaling and lip-syncing.
Export: Download finished assets or send them directly to Studio.
Supported download formats
Video:
- MP4: Codecs H.264, H.265. Quality up to 4K (with upscaling)
Image:
- PNG: High-resolution, lossless output
Models
Image & Video provides access to specialized models optimized for different use cases. Each model offers unique capabilities, from rapid iteration to production-ready quality.
Post-processing models require an existing generated output, though you can also upload your own image or video file.
Video generative models
OpenAI Sora 2 Pro
The most advanced, high-fidelity video model for cinematic results at your disposal.
Generation inputs:
- Text-to-Video
- Start Frame
Features:
- Highest-fidelity, professional-grade output with synced audio
- Precise multi-shot control
- Excels at complex motion and prompt adherence
- Fixed durations: 4s, 8s, and 12s
- Batch creation with up to 4 generations at a time
Output options:
- Resolutions: 720p, 1080p
- Aspect ratios: 16:9, 9:16
Ideal for:
- Cinematic, professional-grade video content
Cost: Starts at 12,000 credits for a generation
End frame is not currently supported. Cannot provide image references. Sound is enabled by default.
OpenAI Sora 2
The standard, high-speed version of OpenAI’s advanced video model, tuned for everyday content creation.
Generation inputs:
- Text-to-Video
- Start Frame
Features:
- Realistic, physics-aware videos with synced audio
- Fine scene control
- Fixed durations: 4s, 8s, and 12s
- Batch creation with up to 4 generations at a time
- Strong narrative and character consistency
Output options:
- Resolutions: 720p, 1080p
- Aspect ratios: 16:9, 9:16
Ideal for:
- Everyday content creation with realistic physics
Cost: Starts at 4,000 credits for default settings
End frame is not currently supported. Cannot provide image references. Sound is enabled by default.
Google Veo 3.1
A professional-grade model for high-quality, cinematic video generation.
Generation inputs:
- Text-to-Video
- Start Frame
- End Frame
- Image References
Features:
- Excellent quality and creative control with negative prompts
- Fully integrated and synchronized audio
- Realistic dialogue, lip-sync, and sound effects
- Fixed durations: 4s, 6s, and 8s
- Batch creation with up to 4 generations at a time
- Dedicated sound control
Output options:
- Resolutions: 720p, 1080p
- Aspect ratios: 16:9, 9:16
Ideal for:
- High-quality, cinematic video generation with full creative control
Cost: Starts at 8,000 credits for default settings
Enabling and disabling sound will change the generation credits.
Kling 2.5
A balanced and versatile model for high-quality, full-HD video generation.
Generation inputs:
- Text-to-Video
- Start Frame
Features:
- Excels at simulating complex motion and realistic physics
- Accurately models fluid dynamics and expressions
- Fixed durations: 5s and 10s
- Batch creation with up to 4 generations at a time
Output options:
- Resolutions: 1080p
- Aspect ratios: 16:9, 1:1, 9:16
Ideal for:
- Realistic physics simulations and complex motion
Cost: Starts at 3,500 credits for default settings
End frame is not currently supported. Cannot provide image references. Sound control not available.
Google Veo 3.1 Fast
A high-speed model optimized for rapid previews and generations, delivering sharper visuals with lower latency.
Generation inputs:
- Text-to-Video
- Start Frame
- End Frame
Features:
- Advanced creative control with negative prompts and dedicated sound control
- Fixed durations: 4s, 6s, and 8s
- Batch creation with up to 4 generations at a time
- Accurately models real-world physics for realistic motion and interactions
Output options:
- Resolutions: 720p, 1080p
- Aspect ratios: 16:9, 9:16
Ideal for:
- Quick iteration and A/B testing visuals
- Fast-paced social media content creation
Cost: Starts at 4,000 credits for default settings
Google Veo 3
Production-ready model delivering exceptional quality, strong physics realism, and coherent narrative audio.
Generation inputs:
- Text-to-Video
- Start Frame
Features:
- Advanced integrated “narrative audio” generation that matches video tone and story
- Granular creative control with negative prompts and dedicated sound control
- Fixed durations: 4s, 6s, and 8s
- Batch creation with up to 4 generations at a time
Output options:
- Resolutions: 720p, 1080p
- Aspect ratios: 16:9, 9:16
Ideal for:
- Final renders and professional marketing content
- Short-form storytelling
Cost: Starts at 8,000 credits for default settings
Google Veo 3 Fast
A high-speed, cost-efficient model for generating audio-backed video from text or a starting image.
Generation inputs:
- Text-to-Video
- Start Frame
Features:
- Granular creative control with negative prompts and dedicated sound control
- Fixed durations: 4s, 6s, and 8s
- Batch creation with up to 4 generations at a time
Output options:
- Resolutions: 720p, 1080p
- Aspect ratios: 16:9, 9:16
Ideal for:
- Rapid iteration and previews
- Cost-effective content creation
Cost: Starts at 4,000 credits for default settings
Seedance 1 Pro
A specialized model for creating dynamic, multi-shot sequences with large movement and action.
Generation inputs:
- Text-to-Video
- Start Frame
- End Frame
Features:
- Highly stable physics and seamless transitions between shots
- Fixed durations: 3s, 4s, 5s, 6s, 7s, 8s, 9s, 10s, 11s, and 12s
- Batch creation with up to 4 generations at a time
- Maximum creative flexibility with numerous aspect ratio options
Output options:
- Resolutions: 480p, 720p, 1080p
- Aspect ratios: 21:9, 16:9, 4:3, 1:1, 3:4, 9:16
Ideal for:
- Storytelling and action scenes requiring stable physics
Cost: Starts at 4,800 credits for default settings
Aspect ratio and resolution do not affect generation credits, but duration does.
Wan 2.5
A versatile model that delivers cinematic motion and high prompt fidelity from text or a starting image.
Generation inputs:
- Text-to-Video
- Start Frame (Image-to-Video)
Features:
- Granular creative control with negative prompts and dedicated sound control
- Fixed durations: 5s and 10s
- Batch creation with up to 4 generations at a time
Output options:
- Resolutions: 480p, 720p, 1080p
- Aspect ratios: 16:9, 1:1, 9:16
Ideal for:
- Cinematic content with strong prompt adherence
Cost: Starts at 2,500 credits for default settings
Generation cost varies based on selected settings.
Image generative models
Google Nano Banana
A high-speed model for quick, high-quality image generation and editing directly from text prompts.
Features:
- Supports multiple image references to guide generation
- Generates up to 4 images at a time
Output options:
- Aspect ratios: 21:9, 16:9, 5:4, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16
Ideal for:
- Rapid image creation and iteration
Cost: Starts at 2,000 credits for default settings; varies based on number of generations
Seedream 4
A specialized image model for generating multi-shot sequences or scenes with large movement and action.
Features:
- Excels at creating images with stable physics and coherent transitions
- Supports multiple image references to guide generation
- Generates up to 4 images at a time
Output options:
- Aspect ratios: auto, 16:9, 4:3, 1:1, 3:4, 9:16
Ideal for:
- Action scenes and dynamic compositions
Cost: Starts at 1,200 credits for default settings; varies based on number of generations
Flux 1 Kontext Pro
A professional model for advanced image generation and editing, offering strong scene coherence and style control.
Features:
- Image-based style control requiring a reference image to guide visual aesthetic
- Generates up to 4 images at a time
Output options:
- Aspect ratios: 21:9, 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16, 9:21
Ideal for:
- Professional content with precise style requirements
Cost: Starts at 1,600 credits; varies based on settings and number of generations
Wan 2.5
An image model with strong prompt fidelity and motion awareness, ideal for capturing dynamic action in a still frame.
Features:
- Granular control with negative prompts
- Supports multiple image references to guide generation
- Generates up to 4 images at a time
Output options:
- Aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16
Ideal for:
- Dynamic still images with motion awareness
Cost: Starts at 2,000 credits; varies based on settings
OpenAI GPT Image 1
A versatile model for precise, high-quality image creation and detailed editing guided by natural language prompts.
Features:
- Supports multiple image references to guide generation
- Generates up to 4 images at a time
Output options:
- Aspect ratios: 3:2, 1:1, 2:3
- Quality options: low, medium, high
Ideal for:
- Creating and editing images with precise, text-based control
Cost: Starts at 2,400 credits for default settings; varies based on settings and number of generations
Lip-sync models
Omnihuman 1.5
A dedicated utility model for generating exceptionally realistic, humanlike lip-sync.
Inputs:
- Static source image
- Speech audio file
Features:
- Animates the mouth on the source image to match provided audio
- Creates high-fidelity “talking” video from still images
- Lip-sync specific tool, not a full video generation model
Ideal for:
- Creating talking avatars
- Adding dialogue to still images
- Professional dubbing workflows
Cost: Depends on generation input
For best results, the image should contain a detectable figure.
Veed LipSync
A fast, affordable, and precise utility model for applying realistic lip-sync to videos.
Inputs:
- Source video
- New speech audio file
Features:
- Re-animates mouth movements in source video to match new audio
- Video-to-video lip-sync tool, not a full video generator
Ideal for:
- High-volume, cost-effective dubbing
- Translating content
- Correcting audio in video clips with realistic results
Cost: Depends on generation input
For best results, the video should contain a detectable figure.
Upscaling model
Topaz Upscale
A dedicated utility model for image and video upscaling, designed to enhance resolution and detail up to 4x.
Features:
- Enhancement tool that processes existing media
- Increases media size while preserving natural textures and minimizing artifacts
- Highly granular upscale factors: 1x, 1.25x, 1.5x, 1.75x, 2x, 3x, 4x
- Video-specific: Flexible frame rate control (keep source or convert to 24, 25, 30, 48, 50, or 60 fps)
Ideal for:
- Improving quality of generated media
- Restoring legacy footage or photos
- Preparing assets for high-resolution displays
Cost: Depends on generation input