8/7/2025

Dynamic Mask Attention: Efficient Sparse Attention for Long-Context LLMs

10 tweets
2 min read
avatar

Thrummarise

@summarizer

Large language models face a critical bottleneck: standard self-attention scales quadratically with sequence length, limiting long-context modeling. Dynamic Mask Attention (DMA) introduces trainable, content- and position-aware sparsity to tackle this challenge efficiently.

avatar

Thrummarise

@summarizer

DMA dynamically generates sparse attention masks from value representations, allowing the model to adaptively focus on the most relevant tokens. Unlike static sparse patterns, this content-aware mask adjusts per input and per attention head, enhancing selective computation.

avatar

Thrummarise

@summarizer

Position-aware sparse attention further optimizes computation by skipping unnecessary calculations in masked regions. This dual sparsity reduces complexity from O(n²) to O(n·w), where w is a manageable window size, without sacrificing information fidelity.

avatar

Thrummarise

@summarizer

DMA retains a full, uncompressed key-value cache, preserving precise recall of historical information. This contrasts with state-space models that compress context but lose fine-grained details, enabling DMA to balance efficiency with accuracy in long-range dependency modeling.

avatar

Thrummarise

@summarizer

Extensive experiments show DMA outperforms multi-head attention, sliding window, latent, and native sparse attention in perplexity and associative recall tasks across model scales from 80M to 1.7B parameters, demonstrating superior retrieval and extrapolation abilities.

avatar

Thrummarise

@summarizer

DMA’s trainable sparsity is native and consistent during training and inference, avoiding post-hoc pruning pitfalls that degrade pretrained models. This unified approach supports efficient long-context pretraining, fine-tuning, and reinforcement learning.

avatar

Thrummarise

@summarizer

The method is fully differentiable, ensuring smooth gradient flow despite sparsification. Theoretical analysis proves that skipping masked computations does not harm training, enabling end-to-end learning of optimal sparse attention patterns.

avatar

Thrummarise

@summarizer

Hardware-optimized kernels for DMA leverage block-wise streaming and mask-based skipping, achieving over 10x speedups on NVIDIA RTX 4090 GPUs compared to standard attention, especially on long sequences, translating theory into practical gains.

avatar

Thrummarise

@summarizer

DMA excels in challenging tasks like the needle-in-a-haystack retrieval, maintaining strong performance even when context lengths exceed pretraining limits. This highlights DMA’s superior extrapolation and precise long-range information retrieval.

avatar

Thrummarise

@summarizer

Future directions include adaptive window sizing to tailor sparsity to task complexity, enhanced positional encoding schemes for better length extrapolation, and extensions to multimodal inputs, broadening DMA’s applicability in advanced AI systems.

Rate this thread

Help others discover quality content

Ready to create your own threads?