KV Cache Summary

What is KV Cache?

KV (Key-Value) cache is an optimization technique for autoregressive models that stores computed attention keys and values from previous steps to avoid recomputation.

How It Works

Without KV Cache

Step 1: Process [Frame1] → Compute K1,V1,Q1 → Generate Frame2
Step 2: Process [Frame1,Frame2] → Compute K1,V1,K2,V2,Q2 → Generate Frame3
Step 3: Process [Frame1,Frame2,Frame3] → Compute K1,V1,K2,V2,K3,V3,Q3 → Generate Frame4

Problem: We recompute K1,V1 three times, K2,V2 twice, etc.

With KV Cache

Step 1: Process [Frame1] → Compute K1,V1,Q1 → Store K1,V1 → Generate Frame2
Step 2: Process [Frame2] → Compute only K2,V2,Q2 → Use cached K1,V1 → Generate Frame3
Step 3: Process [Frame3] → Compute only K3,V3,Q3 → Use cached K1,V1,K2,V2 → Generate Frame4

Solution: Each K,V is computed only once!

Key Implementation Details

  1. Storage Structure:
    cache = {
        'keys': tensor([batch, n_heads, seq_len, d_head]),
        'values': tensor([batch, n_heads, seq_len, d_head])
    }
    
  2. Update Process:
    # Compute K,V only for new positions
    K_new = compute_keys(new_input)
    V_new = compute_values(new_input)
       
    # Concatenate with cached
    K_all = concat([K_cached, K_new])
    V_all = concat([V_cached, V_new])
       
    # Update cache
    cache.update(K_new, V_new)
    
  3. Attention Computation:
    # Q is always computed fresh for current position
    Q = compute_queries(current_input)
       
    # Use all K,V (cached + new)
    attention = softmax(Q @ K_all.T) @ V_all
    

Performance Impact

From our demo:

Memory Cost

When to Use

✅ Use KV Cache for:

❌ Don’t use for:

In Video Models

For autoregressive video models:

Key Takeaway

KV cache trades memory for computation. It’s essential for efficient autoregressive generation in modern transformers, enabling real-time video generation and long-context processing.