Publication

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Jonas Knupp; Jan Hendrik Metzen; Jeremias Bohn; Georg Groh; Kristian Kersting

In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2601.21582, Pages 1-13, arXiv, 2026.

Abstract

Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth- recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To ad- dress this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and ef- fectively. Across language reasoning benchmarks, our models require 2 to 8× fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2× larger SOTA models with the same training to- kens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11× larger expert selection diversity than SOTA MoEs.

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Abstract

More links