Self-Forcing Implementation

A simple implementation of Self-Forcing for autoregressive video generation based on the paper: https://arxiv.org/abs/2506.08009

For standard autoregressive generation (no GT frames):

So for standard generation, Self-Forcing and Diffusion-Forcing are similar - both avoid exposure bias.

But here’s where Diffusion-Forcing shines:

  1. One model, many tasks: The SAME model can do: - Generate video from text (no frames given) - Continue a video (some frames given) - Fill in missing frames (inpainting) - Enhance low-res frames (some noisy frames given)
  2. Real-world scenarios often DO have some clean frames: - Video editing: “Keep frames 1-10, regenerate 11-20” - Video restoration: “These frames are corrupted, fix them” - Frame interpolation: “I have every 3rd frame, fill the gaps”
  3. Robustness: Because it trained with random boundaries, it’s more robust to different starting conditions

Example: Imagine you’re building a video AI product. With:

The advantage isn’t just about avoiding exposure bias - it’s about flexibility. You train once and can deploy for many different use cases.

Overview

Self-Forcing addresses the “exposure bias” problem in autoregressive models by training with self-generated context rather than ground-truth frames. This implementation demonstrates the key concepts:

  1. Autoregressive generation with KV caching - Efficient streaming generation
  2. Self-generated context during training - Reduces exposure bias
  3. Few-step diffusion - Balances quality and speed
  4. Gradient truncation - Manages computational cost
  5. Holistic sequence loss - Improves temporal coherence

Files

Usage

from self_forcing import SimpleDiffusionModel, SelfForcingTrainer

# Create model
model = SimpleDiffusionModel(input_dim=64, hidden_dim=128)

# Initialize trainer
trainer = SelfForcingTrainer(model, num_diffusion_steps=5)

# Train with self-forcing
trainer.train(dataloader, num_epochs=10)

Key Innovation

Unlike traditional autoregressive training that conditions on ground-truth frames, Self-Forcing conditions each frame on previously self-generated outputs during training, making the model more robust to its own prediction errors during inference.