GSM8K (512 tokens)
GSM8K (256 tokens)
vs 80.36% Baseline
HumanEval
Method Overview
Our Elastic-Cache approach introduces a novel adaptive caching strategy for diffusion LLMs that significantly reduces computational overhead while maintaining generation quality. Unlike traditional methods that recompute all key-value pairs at every step, our approach intelligently decides when and where to refresh the cache based on attention dynamics.

The algorithm below shows our complete procedure for adaptive KV cache management. The key innovation lies in monitoring attention patterns of the most-attended tokens and triggering selective cache updates only when necessary, starting from deeper layers where changes are most significant.

Empirical Analysis
Our approach is motivated by three key empirical observations about diffusion LLM decoding. First, distant MASK tokens have minimal attention influence on current predictions. Second, KV drift increases with layer depth, suggesting selective refresh strategies. Third, the most-attended tokens exhibit the smallest changes, making them ideal indicators for cache validity.

Performance Results
Our experimental evaluation demonstrates consistent speedups across multiple diffusion LLM architectures and tasks. We achieve up to 45.1× acceleration while often maintaining or even improving accuracy compared to baseline models. The results show that longer generation lengths benefit more from our approach, indicating excellent scalability properties.
Model | Task | Gen Length | Baseline Accuracy | Elastic-Cache Accuracy | Speedup |
---|---|---|---|---|---|
LLaDA | GSM8K | 256 | 78.01% | 78.24% | 8.2× |
LLaDA | GSM8K | 512 | 77.10% | 77.71% | 25.2× |
LLaDA-1.5 | GSM8K | 512 | 81.35% | 81.35% | 45.1× |
LLaDA | HumanEval | 512 | 43.90% | 46.34% | 5.0× |
LLaDA-1.5 | MBPP | 512 | 38.20% | 39.00% | 32.8× |
The comprehensive LLaDA-Instruct results below show consistent performance across different tasks and generation lengths. Our method scales particularly well with longer sequences, where the benefits of adaptive caching become more pronounced.

Our ablation study reveals the importance of each component and hyperparameter choice. The sliding window mechanism and attention threshold γ provide tunable trade-offs between speed and accuracy, allowing practitioners to optimize for their specific requirements.

Key Innovations
Attention-Aware Cache Control
Monitor the most-attended tokens across layers and use their attention pattern changes as indicators for when KV cache updates are needed. This provides a lightweight, adaptive trigger that avoids unnecessary recomputation.
Layer-Aware Refresh Strategy
Selective refreshing that starts from a boundary layer and applies only to deeper layers, while reusing cached representations from shallow layers that have already converged.
Sliding-Window MASK Caching
Cache distant MASK tokens outside the active prediction window, as they primarily function as length bias rather than contributing meaningfully to current token predictions.
Training-Free Architecture
Requires no modifications to existing diffusion LLM architectures or training procedures. Can be applied as a plug-and-play acceleration technique with tunable speed-accuracy trade-offs.
Method
Elastic-Cache combines three complementary strategies to minimize redundant computation in diffusion LLM decoding while preserving generation quality.
Sliding Window Decoding
Use a flexible sliding window that moves through the sequence. Compute attention only for tokens in the sliding window while reusing cached KV pairs for tokens outside this window.
Attention-Aware Monitoring
Identify the most-attended token at each layer and monitor changes in attention patterns using cosine similarity. When similarity falls below threshold, trigger cache updates.
Layer-Aware Updates
Upon detecting significant attention changes, perform selective cache refresh: reuse shallow layer caches while recomputing deeper layers that capture evolving dependencies.
Citation
title={Attention Is All You Need for KV Cache in Diffusion LLMs},
author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
journal={arXiv preprint},
year={2025}
}