Publikation

TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion

Khang Nguyen; Khai Nguyen; An T. Le; Jan Peters; Manfred Huber; Ngo Anh Vien; Minh Nhat Vu

In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2505.13549, Pages 1-8, arXiv, 2025.

Zusammenfassung

Robot learning in high-dimensional control set- tings, such as humanoid locomotion, presents persistent chal- lenges for reinforcement learning (RL) algorithms due to unstable dynamics, complex contact interactions, and sen- sitivity to distributional shifts during training. Model-based methods, e.g., Temporal-Difference Model Predictive Control (TD-MPC), have demonstrated promising results by combining short-horizon planning with value-based learning, enabling efficient solutions for basic locomotion tasks. However, these approaches remain ineffective in addressing policy mismatch and instability introduced by off-policy updates. Thus, in this work, we introduce Temporal-Difference Group Relative Policy Constraint (TD-GRPC), an extension of the TD-MPC frame- work that unifies Group Relative Policy Optimization (GRPO) with explicit Policy Constraints (PC). TD-GRPC applies a trust-region constraint in the latent policy space to maintain consistency between the planning priors and learned rollouts, while leveraging group-relative ranking to assess and preserve the physical feasibility of candidate trajectories. Unlike prior methods, TD-GRPC achieves robust motions without modifying the underlying planner, enabling flexible planning and policy learning. We validate our method across a locomotion task suite ranging from basic walking to highly dynamic movements on the 26-DoF Unitree H1-2 humanoid robot. Through simulation results, TD-GRPC demonstrates its improvements in stability and policy robustness with sampling efficiency while training for complex humanoid control tasks.

Weitere Links

https://doi.org/10.48550/arXiv.2505.13549

2505.13549v1.pdf (pdf, 2 MB )