Autoregressive Video Model

A simple implementation to understand how autoregressive models work for video generation.

Overview

This codebase demonstrates the core concepts of autoregressive video modeling:

Key Components

1. Model Architecture (model.py)

2. Dataset (dataset.py)

3. Training (train.py)

4. Inference (inference.py)

Quick Start

# Run demo (trains for 10 epochs on synthetic data)
python example.py --mode demo

# Full training
python example.py --mode train

# Generate videos from checkpoint
python example.py --mode generate --checkpoint ./checkpoints/best_model.pt

Understanding the Code

How Autoregressive Video Generation Works

  1. Input Processing:
    • Video frames are divided into patches (e.g., 8x8 pixels)
    • Each patch becomes a token in the sequence
    • Frames are flattened into a long sequence of patches
  2. Autoregressive Prediction:
    • Given frames 1 to t-1, predict frame t
    • Causal mask ensures no “looking ahead”
    • Model learns temporal patterns in video
  3. Generation Process:
    • Start with a few initial frames
    • Predict next frame patch by patch
    • Add predicted frame to context
    • Repeat for desired length

Key Concepts Illustrated

Experiments to Try

  1. Architecture Changes:
    • Vary model size (d_model, num_layers)
    • Change patch size (smaller = more detail)
    • Adjust sequence length (max_frames)
  2. Training Variations:
    • Different motion patterns in dataset
    • Temperature sampling during generation
    • Various learning rates and schedules
  3. Advanced Extensions:
    • Add conditional generation (e.g., text prompts)
    • Implement different positional encodings
    • Try different attention mechanisms

Requirements

torch
numpy
matplotlib
tqdm
tensorboard

Notes