Attention Is All You Need
for KV Cache in Diffusion LLMs

Elastic-Cache: Training-Free, Architecture-Agnostic Acceleration

Quan Nguyen-Tri*¹, Mukul Ranjan*², Zhiqiang Shen²
*Equal contribution
¹FPT AI Residency, Hanoi, Vietnam
²VILA Lab, MBZUAI, Abu Dhabi, UAE
Up to 45× Faster Inference
With Higher Accuracy Than Baseline
45.1×
Maximum Speedup
GSM8K (512 tokens)
8.2×
Average Speedup
GSM8K (256 tokens)
81.50%
Higher Accuracy
vs 80.36% Baseline
Code Generation
HumanEval

Method Overview

Our Elastic-Cache approach introduces a novel adaptive caching strategy for diffusion LLMs that significantly reduces computational overhead while maintaining generation quality. Unlike traditional methods that recompute all key-value pairs at every step, our approach intelligently decides when and where to refresh the cache based on attention dynamics.

Elastic-Cache Overview
Comparison between block-wise decoding and our sliding window approach with layer-aware cache updates.

The algorithm below shows our complete procedure for adaptive KV cache management. The key innovation lies in monitoring attention patterns of the most-attended tokens and triggering selective cache updates only when necessary, starting from deeper layers where changes are most significant.

Elastic-Cache Algorithm
The Elastic-Cache algorithm for adaptive KV cache management in diffusion LLMs.

Empirical Analysis

Our approach is motivated by three key empirical observations about diffusion LLM decoding. First, distant MASK tokens have minimal attention influence on current predictions. Second, KV drift increases with layer depth, suggesting selective refresh strategies. Third, the most-attended tokens exhibit the smallest changes, making them ideal indicators for cache validity.

Motivation Analysis
Empirical analysis showing attention patterns, layer-wise KV drift, and correlation between attention and state changes.

Performance Results

Our experimental evaluation demonstrates consistent speedups across multiple diffusion LLM architectures and tasks. We achieve up to 45.1× acceleration while often maintaining or even improving accuracy compared to baseline models. The results show that longer generation lengths benefit more from our approach, indicating excellent scalability properties.

Model Task Gen Length Baseline Accuracy Elastic-Cache Accuracy Speedup
LLaDA GSM8K 256 78.01% 78.24% 8.2×
LLaDA GSM8K 512 77.10% 77.71% 25.2×
LLaDA-1.5 GSM8K 512 81.35% 81.35% 45.1×
LLaDA HumanEval 512 43.90% 46.34% 5.0×
LLaDA-1.5 MBPP 512 38.20% 39.00% 32.8×

The comprehensive LLaDA-Instruct results below show consistent performance across different tasks and generation lengths. Our method scales particularly well with longer sequences, where the benefits of adaptive caching become more pronounced.

LLaDA Results
Comprehensive results on LLaDA-Instruct showing speedups across mathematical reasoning and code generation tasks.

Our ablation study reveals the importance of each component and hyperparameter choice. The sliding window mechanism and attention threshold γ provide tunable trade-offs between speed and accuracy, allowing practitioners to optimize for their specific requirements.

Ablation Study
Ablation study showing the impact of sliding window size and attention threshold on performance.

Key Innovations

Attention-Aware Cache Control

Monitor the most-attended tokens across layers and use their attention pattern changes as indicators for when KV cache updates are needed. This provides a lightweight, adaptive trigger that avoids unnecessary recomputation.

Layer-Aware Refresh Strategy

Selective refreshing that starts from a boundary layer and applies only to deeper layers, while reusing cached representations from shallow layers that have already converged.

Sliding-Window MASK Caching

Cache distant MASK tokens outside the active prediction window, as they primarily function as length bias rather than contributing meaningfully to current token predictions.

Training-Free Architecture

Requires no modifications to existing diffusion LLM architectures or training procedures. Can be applied as a plug-and-play acceleration technique with tunable speed-accuracy trade-offs.

Method

Elastic-Cache combines three complementary strategies to minimize redundant computation in diffusion LLM decoding while preserving generation quality.

1

Sliding Window Decoding

Use a flexible sliding window that moves through the sequence. Compute attention only for tokens in the sliding window while reusing cached KV pairs for tokens outside this window.

2

Attention-Aware Monitoring

Identify the most-attended token at each layer and monitor changes in attention patterns using cosine similarity. When similarity falls below threshold, trigger cache updates.

3

Layer-Aware Updates

Upon detecting significant attention changes, perform selective cache refresh: reuse shallow layer caches while recomputing deeper layers that capture evolving dependencies.

Citation

@article{nguyen2025elastic,
  title={Attention Is All You Need for KV Cache in Diffusion LLMs},
  author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
  journal={arXiv preprint},
  year={2025}
}