Publication
Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
Jonas Knupp; Jan Hendrik Metzen; Jeremias Bohn; Georg Groh; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2601.21582, Pages 1-13, arXiv, 2026.
Abstract
Depth-recurrence facilitates latent reasoning by
sharing parameters across depths. However, prior
work lacks combined FLOP-, parameter-, and
memory-matched baselines, underutilizes depth-
recurrence due to partially fixed layer stacks, and
ignores the bottleneck of constant hidden-sizes
that restricts many-step latent reasoning. To ad-
dress this, we introduce a modular framework
of depth-recurrent attention mixtures (Dreamer),
combining sequence attention, depth attention,
and sparse expert attention. It alleviates the
hidden-size bottleneck through attention along
depth, decouples scaling dimensions, and allows
depth-recurrent models to scale efficiently and ef-
fectively. Across language reasoning benchmarks,
our models require 2 to 8× fewer training tokens
for the same accuracy as FLOP-, parameter-, and
memory-matched SOTA, and outperform ca. 2×
larger SOTA models with the same training to-
kens. We further present insights into knowledge
usage across depths, e.g., showing 2 to 11× larger
expert selection diversity than SOTA MoEs.
