KV Cache Summary

What is KV Cache?

KV (Key-Value) cache is an optimization technique for autoregressive models that stores computed attention keys and values from previous steps to avoid recomputation.

How It Works

Without KV Cache

Step 1: Process [Frame1] → Compute K1,V1,Q1 → Generate Frame2
Step 2: Process [Frame1,Frame2] → Compute K1,V1,K2,V2,Q2 → Generate Frame3
Step 3: Process [Frame1,Frame2,Frame3] → Compute K1,V1,K2,V2,K3,V3,Q3 → Generate Frame4

Problem: We recompute K1,V1 three times, K2,V2 twice, etc.

With KV Cache

Step 1: Process [Frame1] → Compute K1,V1,Q1 → Store K1,V1 → Generate Frame2
Step 2: Process [Frame2] → Compute only K2,V2,Q2 → Use cached K1,V1 → Generate Frame3
Step 3: Process [Frame3] → Compute only K3,V3,Q3 → Use cached K1,V1,K2,V2 → Generate Frame4

Solution: Each K,V is computed only once!

Key Implementation Details

Storage Structure:

cache = {
    'keys': tensor([batch, n_heads, seq_len, d_head]),
    'values': tensor([batch, n_heads, seq_len, d_head])
}

Update Process:

# Compute K,V only for new positions
K_new = compute_keys(new_input)
V_new = compute_values(new_input)
   
# Concatenate with cached
K_all = concat([K_cached, K_new])
V_all = concat([V_cached, V_new])
   
# Update cache
cache.update(K_new, V_new)

Attention Computation:

# Q is always computed fresh for current position
Q = compute_queries(current_input)
   
# Use all K,V (cached + new)
attention = softmax(Q @ K_all.T) @ V_all

Performance Impact

From our demo:

Without cache: 300 position computations (3+4+5+…+12) × 4 layers
With cache: 52 position computations (3+10) × 4 layers
Speedup: 3.97x
Compute saved: 82.7%

Memory Cost

Per layer: seq_length × 2 × d_model × sizeof(float)
Total: n_layers × seq_length × 2 × d_model × sizeof(float)
Example: 12 layers, 1000 positions, d_model=512 → 49 MB

When to Use

✅ Use KV Cache for:

Inference/generation (biggest benefit)
Long sequences (more savings)
Beam search (share cache across beams)
Streaming generation

❌ Don’t use for:

Training (need gradients)
Very large batch sizes (memory constraint)
Bidirectional models (future context changes)

In Video Models

For autoregressive video models:

Each frame has multiple patches (e.g., 8×8 = 64 patches)
Without cache: Process all frames × all patches each step
With cache: Only process new frame’s patches
Example: 10 frames × 64 patches = 640 positions
- Without cache: 640+704+768+… computations
- With cache: Only 64 new computations per step!

Key Takeaway

KV cache trades memory for computation. It’s essential for efficient autoregressive generation in modern transformers, enabling real-time video generation and long-context processing.