ResearcharXiv

Sparse Mixture of Attention Heads Enables 10x Context Length Scaling

Saturday, March 21, 2026

A new paper from DeepMind introduces Sparse Mixture of Attention (SMoA), which dynamically routes tokens to specialized attention heads. This allows models to process context windows up to 10x longer than standard transformers without proportional increases in compute. The technique shows particular promise for document understanding and multi-turn conversation tasks.

Key Takeaways

Dynamic token routing to specialized attention heads
10x context length scaling with sub-linear compute growth
Strong results on document QA and long-form reasoning
Compatible with existing pre-trained models via fine-tuning

Read Original