Elastic-Cache: Attention-Aware KV Caching for Diffusion LLMs

Method Overview

Our Elastic-Cache approach introduces a novel adaptive caching strategy for diffusion LLMs that significantly reduces computational overhead while maintaining generation quality. Unlike traditional methods that recompute all key-value pairs at every step, our approach intelligently decides when and where to refresh the cache based on attention dynamics.

Comparison between block-wise decoding and our sliding window approach with layer-aware cache updates.

The algorithm below shows our complete procedure for adaptive KV cache management. The key innovation lies in monitoring attention patterns of the most-attended tokens and triggering selective cache updates only when necessary, starting from deeper layers where changes are most significant.

The Elastic-Cache algorithm for adaptive KV cache management in diffusion LLMs.

Empirical Analysis

Our approach is motivated by three key empirical observations about diffusion LLM decoding. First, distant MASK tokens have minimal attention influence on current predictions. Second, KV drift increases with layer depth, suggesting selective refresh strategies. Third, the most-attended tokens exhibit the smallest changes, making them ideal indicators for cache validity.

Empirical analysis showing attention patterns, layer-wise KV drift, and correlation between attention and state changes.

Performance Results

Our experimental evaluation demonstrates consistent speedups across multiple diffusion LLM architectures and tasks. We achieve up to 45.1× acceleration while often maintaining or even improving accuracy compared to baseline models. The results show that longer generation lengths benefit more from our approach, indicating excellent scalability properties.

Model	Task	Gen Length	Baseline Accuracy	Elastic-Cache Accuracy	Speedup
LLaDA	GSM8K	256	78.01%	78.24%	8.2×
LLaDA	GSM8K	512	77.10%	77.71%	25.2×
LLaDA-1.5	GSM8K	512	81.35%	81.35%	45.1×
LLaDA	HumanEval	512	43.90%	46.34%	5.0×
LLaDA-1.5	MBPP	512	38.20%	39.00%	32.8×

The comprehensive LLaDA-Instruct results below show consistent performance across different tasks and generation lengths. Our method scales particularly well with longer sequences, where the benefits of adaptive caching become more pronounced.

Comprehensive results on LLaDA-Instruct showing speedups across mathematical reasoning and code generation tasks.

Our ablation study reveals the importance of each component and hyperparameter choice. The sliding window mechanism and attention threshold γ provide tunable trade-offs between speed and accuracy, allowing practitioners to optimize for their specific requirements.

Ablation study showing the impact of sliding window size and attention threshold on performance.

Key Innovations

Attention-Aware Cache Control

Monitor the most-attended tokens across layers and use their attention pattern changes as indicators for when KV cache updates are needed. This provides a lightweight, adaptive trigger that avoids unnecessary recomputation.

Layer-Aware Refresh Strategy

Selective refreshing that starts from a boundary layer and applies only to deeper layers, while reusing cached representations from shallow layers that have already converged.

Sliding-Window MASK Caching

Cache distant MASK tokens outside the active prediction window, as they primarily function as length bias rather than contributing meaningfully to current token predictions.

Training-Free Architecture

Requires no modifications to existing diffusion LLM architectures or training procedures. Can be applied as a plug-and-play acceleration technique with tunable speed-accuracy trade-offs.

Method

Elastic-Cache combines three complementary strategies to minimize redundant computation in diffusion LLM decoding while preserving generation quality.

1

Sliding Window Decoding

Use a flexible sliding window that moves through the sequence. Compute attention only for tokens in the sliding window while reusing cached KV pairs for tokens outside this window.

2

Attention-Aware Monitoring

Identify the most-attended token at each layer and monitor changes in attention patterns using cosine similarity. When similarity falls below threshold, trigger cache updates.

3

Layer-Aware Updates

Upon detecting significant attention changes, perform selective cache refresh: reuse shallow layer caches while recomputing deeper layers that capture evolving dependencies.

Citation

                    @article{nguyen2025elastic,

                      title={Attention Is All You Need for KV Cache in Diffusion LLMs},

                      author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},

                      journal={arXiv preprint arXiv:2510.14973},

                      year={2025}

                    }