VideoAuteur Paper Explained

[Daily Paper Review: 14-01-25] VideoAuteur: Towards Long Narrative Video Generation

Jan 14, 2025

Recent advancements in video generation have enabled the creation of high-quality short video clips lasting a few seconds. However, generating long-form videos that convey coherent and informative narratives remains a significant challenge. The primary issues include:

Semantic Consistency: Maintaining logical and semantic coherence across multiple clips in a long video is difficult. This includes preserving object/character identity and ensuring that the sequence of events makes sense.
Data Quality: Existing video datasets often lack high-quality, detailed annotations that are necessary for training models to generate long narratives. Many videos are tagged with descriptions that are either too coarse or irrelevant to narrative generation.
Narrative Flow: Creating a seamless narrative flow that captures the progression of events over time is complex. This involves not just visual fidelity but also ensuring that the story being told is clear and engaging.

How VideoAuteur Solves the Problem

VideoAuteur addresses these challenges through a comprehensive approach that includes a novel dataset and an advanced video generation pipeline. Here’s how it tackles each problem:

High-Quality Dataset (CookGen):
- Dataset Curation: VideoAuteur introduces CookGen, a large-scale dataset focused on cooking videos. Cooking videos are chosen because they inherently have clear, step-by-step narratives that are easier to annotate and evaluate consistently.
- Annotation Quality: The dataset includes approximately 200,000 video clips, each with detailed annotations that capture the sequential actions and visual states necessary for narrative generation. This ensures that the data is rich enough to train models effectively.
- Caption-Action Matching: A caption-action matching mechanism is employed to extract narrative clips that follow the strict, step-by-step process inherent in cooking tasks. This ensures that the annotations are both relevant and detailed.
Long Narrative Video Director:
- Narrative Flow Generation: VideoAuteur includes a Long Narrative Video Director that generates a sequence of visual embeddings or keyframes representing the logical progression of the story. This component ensures that the generated video maintains semantic consistency across multiple clips.
- Visual and Semantic Coherence: By aligning visual embeddings, the director enhances both visual and semantic coherence in the generated videos. This alignment is crucial for maintaining the integrity of the narrative over long sequences.
Auto-Regressive Pipeline:
- Interleaved Image-Text Model: VideoAuteur employs an interleaved image-text auto-regressive model to generate visual states. This model integrates text and image embeddings within the video generation process, ensuring that the generated frames are semantically aligned with the narrative.
- Visual-Conditioned Video Generation: The pipeline includes a visual-conditioned video generation model that uses the visual embeddings produced by the narrative director to generate the final video. This model is fine-tuned to ensure high visual fidelity and coherence.

Overview of CookGen

Purpose: CookGen addresses the lack of high-quality datasets for long narrative video generation. It provides detailed annotations (captions, actions, and visual states) to support training and evaluation of NVG models.
Domain: Cooking videos are chosen because they follow a pre-defined sequence of actions, making them ideal for learning and evaluating narrative coherence.
Scale: The dataset contains ~200,000 video clips sourced from 30,000+ raw videos, with an average clip duration of 9.5 seconds.
Annotations: Each clip is annotated with dense captions (average of 763.8 words per video) and action descriptions, ensuring rich semantic information for training.

Dataset Creation Pipeline

The creation of CookGen involves several steps to ensure high-quality annotations and narrative consistency:

Step 1: Video Preprocessing

Filtering: Videos are filtered to remove logos, watermarks, and low-quality content.
Cropping: Clips are extracted to focus on relevant cooking actions.

Step 2: Captioning

Caption Generation: A video captioner is trained using a Vision-Language Model (VLM) inspired by LLaVA-Hound.
- Data Collection: Captions are generated using GPT-4o, focusing on:
  - Object attributes (e.g., ingredients, tools).
  - Subject-object interactions (e.g., chopping vegetables).
  - Temporal dynamics (e.g., sequence of actions).
- Fine-Tuning: The captioner is fine-tuned using LLaVA-NEXT to optimize performance.

Step 3: Action Annotation

Action Labels: Actions are annotated using ASR-based pseudo-labels from HowTo100M.
Refinement: Labels are refined using Large Language Models (LLMs) to improve accuracy and capture narrative context.

Step 4: Caption-Action Matching

Alignment: Captions and actions are matched based on time intervals using Intersection-over-Union (IoU).
- A match is valid if:
  1. The start time difference is less than 5 seconds.
  2. The clip end time is later than the action end time.
  3. The IoU between clip and action intervals is >0.25 (or >0.5 for stricter alignment).
Filtering: Clips without valid matches are filtered out to ensure narrative consistency.

Clip Time

Definition: Clip time refers to the time interval of a specific video clip extracted from a longer video.
Example: If you have a 60-second video and extract a 10-second clip starting at the 20-second mark and ending at the 30-second mark, the clip time is 20s to 30s.
Purpose: Clips are used to break down long videos into smaller, meaningful segments that focus on specific actions or events.

Action Time

Definition: Action time refers to the time interval during which a specific action occurs in the video.
Example: In a cooking video, if the action "chopping vegetables" happens from the 15-second mark to the 25-second mark, the action time is 15s to 25s.
Purpose: Actions are annotated to describe what is happening in the video at specific times, helping to create a narrative flow.

How Clip Time and Action Time Work Together

To ensure that the captions (descriptions of what is happening) align with the actions (what is actually happening in the video), CookGen uses a process called caption-action matching. Here’s how it works:

Clip Time: A video is divided into smaller clips (e.g., 10-second segments).
Action Time: Each action in the video is annotated with a start and end time (e.g., "chopping vegetables" from 15s to 25s).
Matching: The system checks if the clip time overlaps with the action time using a metric called Intersection-over-Union (IoU).

Evaluation of CookGen

The quality of CookGen is evaluated from two perspectives:

1. Inverse Video Generation

Objective: Assess how well the annotated captions can reconstruct the original videos.
Method:
- Generate videos using captions with and without ground-truth keyframes.
- Measure reconstruction quality using Fréchet Video Distance (FVD).
Results:
- With keyframes: FVD = 116.3 (high-quality reconstruction).
- Without keyframes: FVD = 561.1 (reasonable alignment).

2. Semantic Consistency

Objective: Evaluate the quality of captions using GPT-4o and human annotators.
Criteria:
1. Coverage: Captions should describe all relevant video elements.
2. Hallucination: Captions should not include unsupported details.
Results:
- GPT-4o Score: 95.2/100 for CookGen captions.
- Human Evaluation: 82.0/100, slightly better than Qwen2-VL-72B (a state-of-the-art open-source VLM).

VideoAuteur

VideoAuteur is a pipeline designed to generate long-form narrative videos that align with a given text input. It consists of two main components:

Long Narrative Video Director: Generates a sequence of visual embeddings or keyframes that capture the narrative flow.
Visual-Conditioned Video Generation: Uses these visual conditions to generate coherent video clips.

1. Long Narrative Video Director

The Long Narrative Video Director is responsible for generating a sequence of visual embeddings (or keyframes) that represent the narrative progression. It uses a Visual Language Model (VLM) to interleave text and visual content, ensuring coherence between the narrative and the visuals.

1.1 Interleaved Image-Text Director

This component generates a sequence where text tokens and visual embeddings are interleaved. It uses an auto-regressive model to predict the next token based on the accumulated context of both text and images.

1.2 Language-Centric Keyframe Director

This variant uses text-only guidance to generate keyframes. It synthesizes keyframes using a text-conditioned diffusion model:

2. Visual-Conditioned Video Generation

This component generates video clips based on the visual conditions (embeddings or keyframes) produced by the narrative director

1. CLIP-Based Diffusion

What is CLIP?
- CLIP (Contrastive Language–Image Pretraining) is a model trained to understand the relationship between images and text. It encodes both modalities into a shared latent space.
- CLIP embeddings are semantically rich and language-aligned, meaning they capture high-level concepts and relationships between visual and textual data.
How is CLIP used in Diffusion?
- In CLIP-based diffusion, the diffusion model operates in a CLIP latent space.
- The CLIP encoder converts images into embeddings that are aligned with text, enabling the model to generate images that are semantically consistent with textual descriptions.
- The decoder (often a diffusion model) reconstructs images from these CLIP embeddings.
Advantages:
- Language Alignment: CLIP embeddings are inherently aligned with text, making them ideal for text-to-image or text-to-video tasks.
- Semantic Understanding: CLIP captures high-level semantics, which helps in generating images or videos that align with the narrative or textual input.
- Interleaved Generation: CLIP embeddings allow for interleaved text-visual generation, where text and visual content are tightly integrated.
Disadvantages:
- Reconstruction Quality: CLIP embeddings may not preserve fine-grained visual details as well as VAE-based approaches.
- Complexity: Training and fine-tuning CLIP-based models can be computationally expensive.

2. VAE-Based Diffusion

What is a VAE?
- A Variational Autoencoder (VAE) is a generative model that encodes images into a low-dimensional latent space and decodes them back into images.
- VAEs are trained to minimize the reconstruction error between the original image and the decoded image.
How is VAE used in Diffusion?
- In VAE-based diffusion, the diffusion model operates in the VAE latent space.
- The VAE encoder compresses images into a latent representation, and the VAE decoder reconstructs images from this latent space.
- The diffusion model generates or manipulates images by operating directly in this latent space.
Advantages:
- High-Quality Reconstruction: VAEs are excellent at preserving fine-grained visual details, making them ideal for tasks requiring high visual fidelity.
- Efficiency: VAEs provide a compact latent space, which can be computationally efficient for training and inference.
- Stability: VAEs are well-studied and provide stable training dynamics for diffusion models.
Disadvantages:
- Lack of Language Alignment: VAEs are not inherently aligned with text, making them less suitable for text-to-image or text-to-video tasks without additional mechanisms (e.g., cross-attention).
- Limited Semantic Understanding: VAEs focus on pixel-level reconstruction rather than high-level semantics, which can limit their ability to generate images or videos that align with complex narratives.

3. Do We Need VAEs for Diffusion?

Yes, but it depends on the task:
- If the goal is high-quality image reconstruction with fine-grained details, VAEs are essential. They provide a compact and efficient latent space for diffusion models to operate in.
- If the goal is text-to-image or text-to-video generation, CLIP-based approaches are more suitable because of their language alignment and semantic understanding.
Combining Both:
- Some models (e.g., CLIP-Diffusion) combine the strengths of both approaches:
  - Use CLIP embeddings for semantic alignment and language understanding.
  - Use VAE-like decoders (e.g., diffusion models) for high-quality image reconstruction.
- This hybrid approach allows for semantically rich and visually detailed generation.

Results

1. Interleaved Narrative Director

The Interleaved Narrative Director was evaluated to understand its effectiveness in generating coherent visual embeddings and keyframes.

Key Findings:

Visual Latent Space:
- CLIP-Diffusion autoencoders (e.g., SEED-X, EMU-2) outperformed VAE-based latent spaces (e.g., SDXL-VAE) in terms of visual generation quality.
- Reason: CLIP embeddings are language-aligned, making them more suitable for interleaved visual generation.
Regression Loss:
- Combining MSE loss (for scale) and cosine similarity loss (for direction) yielded the best results.
- Scale and direction are both critical for accurate latent regression.
Regression Task:
- A chain of reasoning from actions → captions → visual states proved most effective.
- This approach enhanced both training convergence and generation quality.
Interleaved vs. Language-Centric:
- Interleaved methods achieved higher scores in realism and visual consistency.
- Language-centric methods (e.g., SDXL, FLUX.1-s) scored higher in aesthetic quality.
- Conclusion: Visual consistency is more crucial for long narrative video generation, making interleaved methods preferable.

2. Visual-Conditioned Video Generation

The Visual-Conditioned Video Generation component was evaluated to assess its ability to generate coherent videos based on visual embeddings.

Key Findings:

Visual Embeddings vs. Keyframes:
- Conditioning on visual embeddings (generated by the interleaved director) outperformed conditioning on keyframes.
- Conclusion: Visual embeddings provide better semantic alignment and video quality.
Noise Handling:
- Training with noisy visual embeddings improved the model’s robustness to imperfect conditions.
- This ensured that the model could handle regression errors effectively.

3. Overall Learnings

CLIP-Diffusion Latent Spaces:
- CLIP-based latent spaces are more effective for interleaved visual generation due to their language alignment.
- They outperform VAE-based latent spaces in both semantic understanding and visual quality.
Loss Design:
- Combining MSE loss and cosine similarity loss is crucial for accurate regression of visual embeddings.
- This ensures both scale and direction are preserved.
Narrative Flow:
- A chain of reasoning from actions → captions → visual states enhances narrative coherence and generation quality.
Interleaved Generation:
- Interleaved methods are superior for long narrative video generation due to their ability to maintain visual consistency and realism.
Visual Embeddings:
- Using visual embeddings (rather than keyframes) for video generation improves semantic alignment and overall video quality.

Gyanendra’s Substack

Discussion about this post