Pyramidal Flow Matching AI: Text to video, Image to Video, Video Generator

What is Pyramid Flow?

Video generation is a complex task that involves modeling a vast spatiotemporal space, which typically demands significant computational resources and data usage.

To address this, many existing approaches use a cascaded architecture to avoid direct training with full resolution. While this reduces computational demands, it often leads to separate optimization of each sub-stage, which can hinder knowledge sharing and limit flexibility.

Pyramid Flow AI Video Generation

In this work, I introduce a unified pyramidal flow matching algorithm. This approach reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at full resolution.

This design enables more efficient video generative modeling. By interlinking the flows of different pyramid stages, we maintain continuity throughout the process.

Additionally, we incorporate autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework is optimized in an end-to-end manner using a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method can generate high-quality 5-second (and up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours.

Pyramid Flow Overview

This is the official repository for Pyramid Flow: https://github.com/jy0205/Pyramid-Flow, a training-efficient autoregressive video generation method based on Flow Matching. By training only on open-source datasets, it can generate high-quality 10-second videos at 768p resolution and 24 FPS. The method also naturally supports image-to-video generation.

Pyramidal Flow Video Generator Overview

Detail	Description
Name	Pyramidal Flow Matching for Efficient Video Generative Modeling
Purpose	Text to Video, AI Video Generator
GitHub Page	Pyramidal Flow GitHub
Official Paper	Pyramidal Flow Paper on arXiv
Official HuggingFace	pyramid-flow-sd3

Key Features:

10s, 768p, 24fps: High-quality video generation.
5s, 768p, 24fps: Shorter video generation with the same quality.
Image-to-video: Seamless transition from images to videos.

Pyramid Flow Vbench comparison

Recent Updates:

2024.11.13: Release of the 768p miniFLUX checkpoint (up to 10s).
2024.10.29: Release of training code for VAE, finetuning code for DiT, and new model checkpoints with FLUX structure.
2024.10.13: Support for multi-GPU inference and CPU offloading.
2024.10.11: Hugging Face demo available.
2024.10.10: Release of the technical report, project page, and model checkpoint.

Introduction
Installation
Inference
- Quick Start with Gradio
- Inference Code
- Multi-GPU Inference
- Usage Tips
Training
- Training VAE
- Finetuning DiT

Installation

To set up the environment, I recommend using conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2. We are actively working to support a wider range of versions.

git clone https://github.com/jy0205/Pyramid-Flow
cd Pyramid-Flow

# Create environment using conda
conda create -n pyramid python==3.8.10
conda activate pyramid
pip install -r requirements.txt

Next, download the model from Huggingface. There are two variants: miniFLUX and SD3. The miniFLUX models support 1024p image, 384p, and 768p video generation, while the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, and the 768p checkpoint generates up to 10-second video at 24FPS.

from huggingface_hub import snapshot_download

model_path = 'PATH'   # The local directory to save downloaded checkpoint
snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')

Inference

1. Quick Start with Gradio

To get started, first install Gradio, set your model path at #L36, and then run the following command on your local machine:

python app.py

The Gradio demo will open in your browser. Alternatively, you can try it out on Hugging Face Space, which is limited to generating 25 frames due to GPU constraints.

Quick Start on Google Colab

To quickly try out Pyramid Flow on Google Colab, run the following code:

# Setup
!git clone https://github.com/jy0205/Pyramid-Flow
%cd Pyramid-Flow
!pip install -r requirements.txt
!pip install gradio

# Download miniFLUX
from huggingface_hub import snapshot_download

model_path = '/content/Pyramid-Flow'
snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')

# Start
!python app.py

2. Inference Code

To use the model, follow the inference code in video_generation_demo.ipynb. I recommend using the latest published pyramid-miniflux model, which shows significant improvements in human structure and motion stability. Here’s a simplified two-step procedure:

Load the downloaded model:

import torch
from PIL import Image
from pyramid_dit import PyramidDiTForVideoGeneration
from diffusers.utils import load_image, export_to_video

torch.cuda.set_device(0)
model_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16 (not support fp16 yet)

model = PyramidDiTForVideoGeneration(
    'PATH',                                         # The downloaded checkpoint dir
    model_name="pyramid_flux",
    model_dtype=model_dtype,
    model_variant='diffusion_transformer_768p',
)

model.vae.enable_tiling()
model.enable_sequential_cpu_offload()

Text-to-video generation:

prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"

# used for 768p model variant
width = 1280
height = 768

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
    frames = model.generate(
        prompt=prompt,
        num_inference_steps=[20, 20, 20],
        video_num_inference_steps=[10, 10, 10],
        height=height,     
        width=width,
        temp=16,                    # temp=16: 5s, temp=31: 10s
        guidance_scale=7.0,         # The guidance for the first frame, set it to 7 for 384p variant
        video_guidance_scale=5.0,   # The guidance for the other video latent
        output_type="pil",
        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video(frames, "./text_to_video_sample.mp4", fps=24)

Image-to-video generation:

# used for 768p model variant
width = 1280
height = 768

image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((width, height))
prompt = "FPV flying over the Great Wall"

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
    frames = model.generate_i2v(
        prompt=prompt,
        input_image=image,
        num_inference_steps=[10, 10, 10],
        temp=16,
        video_guidance_scale=4.0,
        output_type="pil",
        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video(frames, "./image_to_video_sample.mp4", fps=24)

3. Multi-GPU Inference

For users with multiple GPUs, we provide an inference script that uses sequence parallelism to save memory on each GPU. This also brings a significant speedup. For example, generating a 5s, 768p, 24fps video takes only 2.5 minutes on 4 A100 GPUs, compared to 5.5 minutes on a single A100 GPU. Run the following command to use 2 GPUs:

CUDA_VISIBLE_DEVICES=0,1 sh scripts/inference_multigpu.sh

4. Usage Tips

Guidance Scale: Controls visual quality. Use a value within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
Video Guidance Scale: Controls motion. A larger value increases dynamics, while a smaller value stabilizes the video.
10-second Video Generation: Use a guidance scale of 7 and a video guidance scale of 5.

Pyramid Flow Vbench comparison

Training

1. Training VAE

Training VAE requires at least 8 A100 GPUs. The VAE is a MAGVIT-v2 like continuous 3D VAE, which is flexible and can be used to build your own video generative model.

2. Finetuning DiT

Finetuning DiT also requires at least 8 A100 GPUs. We provide instructions for both autoregressive and non-autoregressive versions of Pyramid Flow. The autoregressive version is more research-oriented, while the non-autoregressive version is more stable but less efficient without the temporal pyramid.