Publikation
TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion
Khang Nguyen; Khai Nguyen; An T. Le; Jan Peters; Manfred Huber; Ngo Anh Vien; Minh Nhat Vu
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2505.13549, Pages 1-8, arXiv, 2025.
Zusammenfassung
Robot learning in high-dimensional control set-
tings, such as humanoid locomotion, presents persistent chal-
lenges for reinforcement learning (RL) algorithms due to
unstable dynamics, complex contact interactions, and sen-
sitivity to distributional shifts during training. Model-based
methods, e.g., Temporal-Difference Model Predictive Control
(TD-MPC), have demonstrated promising results by combining
short-horizon planning with value-based learning, enabling
efficient solutions for basic locomotion tasks. However, these
approaches remain ineffective in addressing policy mismatch
and instability introduced by off-policy updates. Thus, in this
work, we introduce Temporal-Difference Group Relative Policy
Constraint (TD-GRPC), an extension of the TD-MPC frame-
work that unifies Group Relative Policy Optimization (GRPO)
with explicit Policy Constraints (PC). TD-GRPC applies a
trust-region constraint in the latent policy space to maintain
consistency between the planning priors and learned rollouts,
while leveraging group-relative ranking to assess and preserve
the physical feasibility of candidate trajectories. Unlike prior
methods, TD-GRPC achieves robust motions without modifying
the underlying planner, enabling flexible planning and policy
learning. We validate our method across a locomotion task suite
ranging from basic walking to highly dynamic movements on
the 26-DoF Unitree H1-2 humanoid robot. Through simulation
results, TD-GRPC demonstrates its improvements in stability
and policy robustness with sampling efficiency while training
for complex humanoid control tasks.
