Publikation
Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI
Dominik Pfeiffer; Min Zhang; Pengyi Li; Yifu Yuan; Lingfeng Zhang; Yuecheng Liu; Peilong Han; Longxin Kou; Shaojin Ma; Jinbin Qiao; David Gamaliel Arcos Bravo; Yuening Wang; Xiao Hu; Zhanguang Zhang; Xianze Yao; Yutong Li; Han Zhao Zhang; Ying Wen; Ying-Cong Chen; Xiaodan Liang; Liang Lin; Robin Scheid; Haitham Bou-Ammar; He Wang; Huazhe Xu; Jiankang Deng; Shan Luo; Shuqiang Jiang; Wei Pan; Yang Gao; Stefanos Zafeiriou; Jan Peters; Yuzheng Zhuang; Yingxue Zhang; Yan Zheng; Hongyao Tang; Jianye Hao
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2509.15273, Pages 1-32, arXiv, 2025.
Zusammenfassung
Embodied AI has shown great promise in empowering AI models to perceive, interact with, and ultimately change
the physical world. Parallel to the development of large foundation models, Embodied AI is largely falling behind.
Located at the center of Embodied AI, three essential challenges emerge and become even more stringent: (1)
systematic understanding of the core capabilities needed for Embodied AI is missing in the community, making
research lack of clear objectives; (2) despite the proposal of various benchmarks for Embodied AI, there is no
unified and standardized evaluation system, leaving the cross-benchmark evaluation and comparison infeasible;
(3) different from large language models (LLMs) powered by numerous web-scale data, automated and scalable
acquisition methods for embodied data have not been well developed, which poses a critical bottleneck on the
scaling of evaluation and training of Embodied AI models. To break the three obstacles, this paper presents
Embodied Arena, a comprehensive, unified, and evolving evaluation platform and leaderboards for Embodied AI.
First, Embodied Arena is established upon a systematic embodied capability taxonomy spanning three levels
(i.e., perception, reasoning, task execution), seven core embodied capabilities, and 25 fine-grained dimensions.
This taxonomy is proposed by absorbing and refining the partial categories in prior works, which allows for unified
evaluation and offers systematic objectives for frontier research. Second, Embodied Arena closes the critical gap
in standardized evaluation by introducing a unified embodied evaluation system. The system is built upon a
unified evaluation infrastructure supporting flexible integration of advanced benchmarks and models, which has
covered 22 diverse benchmarks across three domains (2D/3D Embodied Q&A, Navigation, and Task Planning)
and 30+ advanced models from 20+ worldwide institutes. Third, Embodied Arena is powered by a novel
LLM-driven automated generation pipeline that ensures the scalability of embodied evaluation data and allows
it to keep evolving for diversity and comprehensiveness. Building upon the three major components, Embodied
Arena addresses the three essential challenges correspondingly. Moreover, Embodied Arena provides professional
support for more advanced models and embodied benchmarks to join, along with frequent maintenance and
updates. Through comprehensive evaluation of the growing model population based on evolving evaluation data,
Embodied Arena publishes three types of leaderboards (i.e., Embodied Q&A, Embodied Navigation, Embodied
Task Planning) with two orthogonal views (i.e., the benchmark view and the capability view), offering a real-time
overview of the embodied capabilities of advanced models. Especially, we present nine findings summarized
from the evaluation results on the leaderboards of Embodied Arena. This helps to establish clear research veins
and pinpoint critical research problems, thereby driving forward progress in the field of Embodied AI.
