Back to News
ResearcharXiv

Sparse Mixture of Attention Heads Enables 10x Context Length Scaling

Saturday, March 21, 2026

A new paper from DeepMind introduces Sparse Mixture of Attention (SMoA), which dynamically routes tokens to specialized attention heads. This allows models to process context windows up to 10x longer than standard transformers without proportional increases in compute. The technique shows particular promise for document understanding and multi-turn conversation tasks.

Key Takeaways

  • Dynamic token routing to specialized attention heads
  • 10x context length scaling with sub-linear compute growth
  • Strong results on document QA and long-form reasoning
  • Compatible with existing pre-trained models via fine-tuning