KV Cache Summary
What is KV Cache?
KV (Key-Value) cache is an optimization technique for autoregressive models that stores computed attention keys and values from previous steps to avoid recomputation.
How It Works
Without KV Cache
Step 1: Process [Frame1] → Compute K1,V1,Q1 → Generate Frame2
Step 2: Process [Frame1,Frame2] → Compute K1,V1,K2,V2,Q2 → Generate Frame3
Step 3: Process [Frame1,Frame2,Frame3] → Compute K1,V1,K2,V2,K3,V3,Q3 → Generate Frame4
Problem: We recompute K1,V1 three times, K2,V2 twice, etc.
With KV Cache
Step 1: Process [Frame1] → Compute K1,V1,Q1 → Store K1,V1 → Generate Frame2
Step 2: Process [Frame2] → Compute only K2,V2,Q2 → Use cached K1,V1 → Generate Frame3
Step 3: Process [Frame3] → Compute only K3,V3,Q3 → Use cached K1,V1,K2,V2 → Generate Frame4
Solution: Each K,V is computed only once!
Key Implementation Details
- Storage Structure:
cache = { 'keys': tensor([batch, n_heads, seq_len, d_head]), 'values': tensor([batch, n_heads, seq_len, d_head]) }
- Update Process:
# Compute K,V only for new positions K_new = compute_keys(new_input) V_new = compute_values(new_input) # Concatenate with cached K_all = concat([K_cached, K_new]) V_all = concat([V_cached, V_new]) # Update cache cache.update(K_new, V_new)
- Attention Computation:
# Q is always computed fresh for current position Q = compute_queries(current_input) # Use all K,V (cached + new) attention = softmax(Q @ K_all.T) @ V_all
Performance Impact
From our demo:
- Without cache: 300 position computations (3+4+5+…+12) × 4 layers
- With cache: 52 position computations (3+10) × 4 layers
- Speedup: 3.97x
- Compute saved: 82.7%
Memory Cost
- Per layer:
seq_length × 2 × d_model × sizeof(float)
- Total:
n_layers × seq_length × 2 × d_model × sizeof(float)
- Example: 12 layers, 1000 positions, d_model=512 → 49 MB
When to Use
✅ Use KV Cache for:
- Inference/generation (biggest benefit)
- Long sequences (more savings)
- Beam search (share cache across beams)
- Streaming generation
❌ Don’t use for:
- Training (need gradients)
- Very large batch sizes (memory constraint)
- Bidirectional models (future context changes)
In Video Models
For autoregressive video models:
- Each frame has multiple patches (e.g., 8×8 = 64 patches)
- Without cache: Process all frames × all patches each step
- With cache: Only process new frame’s patches
- Example: 10 frames × 64 patches = 640 positions
- Without cache: 640+704+768+… computations
- With cache: Only 64 new computations per step!
Key Takeaway
KV cache trades memory for computation. It’s essential for efficient autoregressive generation in modern transformers, enabling real-time video generation and long-context processing.