Self-Forcing Implementation

A simple implementation of Self-Forcing for autoregressive video generation based on the paper: https://arxiv.org/abs/2506.08009

For standard autoregressive generation (no GT frames):

Teacher-Forcing: Trains with GT → inference without GT = exposure bias
Self-Forcing: Trains with generated → inference with generated = no exposure bias ✓
Diffusion-Forcing: Also handles this case well (boundary=0 during training)

So for standard generation, Self-Forcing and Diffusion-Forcing are similar - both avoid exposure bias.

But here’s where Diffusion-Forcing shines:

One model, many tasks: The SAME model can do: - Generate video from text (no frames given) - Continue a video (some frames given) - Fill in missing frames (inpainting) - Enhance low-res frames (some noisy frames given)
Real-world scenarios often DO have some clean frames: - Video editing: “Keep frames 1-10, regenerate 11-20” - Video restoration: “These frames are corrupted, fix them” - Frame interpolation: “I have every 3rd frame, fill the gaps”
Robustness: Because it trained with random boundaries, it’s more robust to different starting conditions

Example: Imagine you’re building a video AI product. With:

Teacher/Self-Forcing: Need separate models for generation, inpainting, interpolation
Diffusion-Forcing: ONE model handles everything

The advantage isn’t just about avoiding exposure bias - it’s about flexibility. You train once and can deploy for many different use cases.

Overview

Self-Forcing addresses the “exposure bias” problem in autoregressive models by training with self-generated context rather than ground-truth frames. This implementation demonstrates the key concepts:

Autoregressive generation with KV caching - Efficient streaming generation
Self-generated context during training - Reduces exposure bias
Few-step diffusion - Balances quality and speed
Gradient truncation - Manages computational cost
Holistic sequence loss - Improves temporal coherence

Files

self_forcing.py - Core implementation with:
- KVCache: Rolling key-value cache for streaming
- SimpleDiffusionModel: Basic diffusion model with attention
- SelfForcingTrainer: Training logic with self-forcing
example.py - Demonstration of training and streaming generation

Usage

from self_forcing import SimpleDiffusionModel, SelfForcingTrainer

# Create model
model = SimpleDiffusionModel(input_dim=64, hidden_dim=128)

# Initialize trainer
trainer = SelfForcingTrainer(model, num_diffusion_steps=5)

# Train with self-forcing
trainer.train(dataloader, num_epochs=10)

Key Innovation

Unlike traditional autoregressive training that conditions on ground-truth frames, Self-Forcing conditions each frame on previously self-generated outputs during training, making the model more robust to its own prediction errors during inference.