StreamingT2V: Autoregressive Long Video Generation with Smooth Transitions from Text

What is StreamingT2V?

Text-to-video diffusion models have made it possible to generate high-quality videos that align with text instructions, enabling the creation of diverse and personalized content. However, most existing methods focus on generating short videos, typically 16 or 24 frames, which results in abrupt cuts when extended to longer video synthesis.

StreamingT2V: Long Video Generation with Smooth Transitions from Text

To address these limitations, I introduce StreamingT2V, an autoregressive approach designed for generating long videos of 80, 240, 600, 1200, or even more frames with smooth transitions. This method ensures consistency, dynamic motion, and extendability, making it a robust solution for long video generation.

StreamingT2V Video Generation

StreamingT2V Text to Video Overview

Detail	Description
Name	StreamingT2V Text to Video
Purpose	Text to Video, AI Video Generator
GitHub Page	StreamingT2V GitHub Pages
Official Paper	StreamingT2V Paper on arXiv
Official HuggingFace	StreamingT2V HuggingFace
StreamingT2V Github Code	Github Code

Key Components of StreamingT2V

StreamingT2V incorporates three main components to achieve its goals:

Conditional Attention Module (CAM): This short-term memory block conditions the current video generation on features extracted from the previous chunk using an attentional mechanism. This ensures smooth transitions between chunks and maintains high motion quality throughout the video.
Appearance Preservation Module (APM): As a long-term memory block, APM extracts high-level scene and object features from the first video chunk. It injects these features into the text cross-attentions of the video diffusion model (VDM), preventing the model from forgetting the initial scene and preserving object/scene consistency across the video.
Randomized Blending Approach: This technique allows the application of a video enhancer autoregressively for infinitely long videos without introducing inconsistencies between chunks. It ensures that the video remains visually coherent even as it extends in length.

Experiments demonstrate that StreamingT2V generates videos with a high degree of motion, outperforming competing image-to-video methods that often result in stagnation when applied autoregressively.

StreamingT2V stands out as a high-quality, seamless text-to-long video generator.

Method Overview

The StreamingT2V pipeline consists of three stages:

Initialization Stage: The first 16-frame chunk is synthesized using a text-to-video model.
Streaming T2V Stage: New content for additional frames is generated autoregressively.
Streaming Refinement Stage: The generated long video (600, 1200 frames, or more) is enhanced autoregressively using a high-resolution text-to-short-video model, combined with the randomized blending approach.

Detailed Method Pipeline

StreamingT2V extends a video diffusion model (VDM) by integrating the Conditional Attention Module (CAM) and the Appearance Preservation Module (APM):

CAM: Conditions the VDM on the previous chunk using a frame encoder. Its attentional mechanism ensures smooth transitions between chunks and maintains high motion quality.
APM: Extracts high-level image features from an anchor frame and injects them into the text cross-attentions of the VDM. This preserves object and scene features throughout the autoregressive video generation process.

StreamingT2V Video Generation

StreamingSVD: An Enhanced Autoregressive Method

StreamingSVD is an advanced autoregressive technique for text-to-video and image-to-video generation. It transforms SVD into a long video generator capable of producing high-quality videos with rich motion dynamics. StreamingSVD ensures temporal consistency, aligns closely with the input text or image, and maintains high frame-level image quality.

Key Features of StreamingSVD

Temporal Consistency: Ensures smooth transitions and coherence throughout the video.
High Motion Dynamics: Generates videos with rich motion, avoiding stagnation.
Extendability: Successfully demonstrated with videos up to 200 frames (8 seconds), with potential for even longer durations.

StreamingSVD is part of the StreamingT2V family. Another notable implementation is StreamingModelscope, which turns Modelscope into a long-video generator capable of producing videos up to 2 minutes in length with high motion quality and no stagnation.

News and Updates

[11/28/2024]: Memory-optimized version released!
[08/30/2024]: Code and model released! The model weights are available on HuggingFace.

For detailed results, visit the Project Page.

Requirements

To run StreamingT2V, the following requirements must be met:

VRAM: The default configuration requires 60 GB of VRAM for generating 200 frames. The memory-optimized version reduces this to 24 GB but operates approximately 50% slower.
System: Tested on Linux using Python 3.9 and CUDA >= 11.8.
FFMPEG: Ensure FFMPEG is installed.

Setup Instructions

To set up StreamingT2V, follow these steps:

Clone the repository:

git clone https://github.com/Picsart-AI-Research/StreamingT2V.git
cd StreamingT2V/

Create a virtual environment and install dependencies:

virtualenv -p python3.9 venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Ensure FFMPEG is installed on your system.

Inference

Image-to-Video

To run the entire pipeline for image-to-video generation, including video enhancement and frame interpolation, execute the following commands from the StreamingT2V folder:

cd code
python inference_i2v.py --input $INPUT --output $OUTPUT

$INPUT: Path to an image file or folder containing images. Each image should have a 16:9 aspect ratio.
$OUTPUT: Path to the folder where results will be stored.

Adjusting Hyperparameters

Number of Frames: Add --num_frames $FRAMES to define the number of frames to generate. Default: $FRAMES=200.
Randomized Blending: Add --use_randomized_blending $RB to enable randomized blending. Default: $RB=False. Recommended values for chunk_size and overlap_size are --chunk_size 38 and --overlap_size 12, respectively. Note that randomized blending slows down the generation process.
Output FPS: Add --out_fps $FPS to define the FPS of the output video. Default: $FPS=24.
Memory Optimization: Use --use_memopt to enable memory optimizations for hardware with 24 GB VRAM. If using a previously cloned repository, update the environment and delete the code/checkpoint/i2v_enhance folder to ensure the correct version is used.

StreamingT2V Video Generation

Future Plans

Release a technical report describing StreamingSVD.
Launch StreamingSVD for text-to-video generation.
Reduce VRAM memory requirements further.
Introduce Motion Aware Warp Error (MAWE): A proposed metric for evaluating motion quality in generated videos.

StreamingModelscope

The code for the StreamingT2V model based on Modelscope is now available. For more details, refer to the paper.

Acknowledgments

SVD: An image-to-video method.
Align Your Steps: A method for optimizing sampling schedules.
I2VGen-XL: An image-to-video method.
EMA-VFI: A state-of-the-art video-frame interpolation method.
Diffusers: A framework for diffusion models.

StreamingT2V: Autoregressive Long Video Generation with Smooth Transitions from Text

What is StreamingT2V?

StreamingT2V: Long Video Generation with Smooth Transitions from Text

StreamingT2V Text to Video Overview

Key Components of StreamingT2V

Method Overview

Detailed Method Pipeline

StreamingSVD: An Enhanced Autoregressive Method

Key Features of StreamingSVD

News and Updates

Requirements

Setup Instructions

Inference

Image-to-Video

Adjusting Hyperparameters

Future Plans

StreamingModelscope

Acknowledgments

Related Posts

3DTrajMaster: A Step-by-Step Guide to Video Motion Control

Caracal AI: Free Tool for Handwritten Text Recognition, Extract text from Images

Browser-Use Free AI Agent: Now AI Can control your Web Browser