Publikation
XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning
Daniel Palenicek; Florian Vogt; Joe Watson; Ingmar Posner; Jan Peters
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2509.25174, Pages 1-24, arXiv, 2025.
Zusammenfassung
Sample efficiency is a central property of effective deep reinforcement learn-
ing algorithms. Recent work has improved this through added complexity, such
as larger models, exotic network architectures, and more complex algorithms,
which are typically motivated purely by empirical performance. We take a more
principled approach by focusing on the optimization landscape of the critic net-
work. Using the eigenspectrum and condition number of the critic’s Hessian,
we systematically investigate the impact of common architectural design deci-
sions on training dynamics. Our analysis reveals that a novel combination of
batch normalization (BN), weight normalization (WN), and a distributional cross-
entropy (CE) loss produces condition numbers orders of magnitude smaller than
baselines. This combination also naturally bounds gradient norms, a property crit-
ical for maintaining a stable effective learning rate under non-stationary targets
and bootstrapping. Based on these insights, we introduce XQC: a well-motivated,
sample-efficient deep actor-critic algorithm built upon soft actor-critic that em-
bodies these optimization-aware principles. We achieve state-of-the-art sample
efficiency across 55 proprioception and 15 vision-based continuous control tasks,
all while using significantly fewer parameters than competing methods.
