Publication
Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining
Rupert Mitchell; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2509.10406, Pages 1-14, Computing Research Repository, 2025.
Abstract
We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention
that combines semantic clustering with multipole expansions from computational physics. Our
method addresses the quadratic computational complexity of transformers in the context length by
clustering queries and keys separately in their learned representation spaces, enabling a hierarchical
two-stage attention mechanism. Unlike prior clustering approaches that group only keys or use
unified clustering, we maintain separate clusterings that respect attention’s asymmetric treatment
of these spaces. We augment centroid-based (monopole) approximations with dipole corrections
that capture directional variance within clusters, preserving richer information during training. The
method operates as a drop-in replacement for standard attention, requiring only hyperparameter
specification without architectural modifications. Our approach achieves O(N CD) complexity for
acausal attention with C clusters and O(N CD log N ) for causal attention. On isolated attention
layers, we demonstrate 3× speedup over CUDNN Flash Attention at 8k context length, with relative
squared errors below 20%. For causal attention, we develop a hierarchical block decomposition that
combines exact local computation with efficient long-range approximation. In end-to-end pretraining
of a 30M parameter model on book-length texts with 16k context, we achieve 12.2% runtime reduction
with only 0.36% loss degradation, establishing the viability of multipole approximations for efficient
transformer pretraining.
