Skip to main content Skip to main navigation

Publication

Findings of the WMT25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help

Alon Lavie; Greg Hanneman; Sweta Agrawal; Diptesh Kanojia; Chi-Kiu Lo; Vilém Zouhar; Frederic Blain; Chrysoula Zerva; Eleftherios Avramidis; Sourabh Deoghare; Archchana Sindhujan; Jiayi Wang; David Ifeoluwa Adelani; Brian Thompson; Tom Kocmi; Markus Freitag; Daniel Deutsch
In: Barry Haddow; Tom Kocmi; Philipp Koehn; Christof Monz (Hrsg.). Proceedings of the Tenth Conference on Machine Translation. Conference on Machine Translation (WMT-25), located at EMNLP2025, November 8-9, Suzhou, China, Pages 436-483, ISBN 979-8-89176-341-8, Association for Computational Linguistics, 11/2025.

Abstract

The WMT25 Shared Task on Automated Translation Evaluation Systems evaluates metrics and quality estimation systems that assess the quality of language translation systems. This task unifies and consolidates the separate WMT shared tasks on Machine Translation Evaluation Metrics and Quality Estimation from previous years. Our primary goal is to encourage the development and assessment of new state-of-the-art translation quality evaluation systems. The shared task this year consisted of three subtasks: (1) segment-level quality score prediction, (2) span-level translation error annotation, and (3) quality-informed segment-level error correction. The evaluation data for the shared task were provided by the General MT shared task and were complemented by ``challenge sets'' from both the organizers and participants. Task 1 results indicate the strong performance of large LLMs at the system level, whilereference-based baseline metrics outperform LLMs at the segment level. Task 2 results indicate that accurate error detection and balancing precision and recall are persistent challenges. Task 3 results show that minimal editing is challenging even when informed by quality indicators. Robustness across the broad diversity of languages remains a major challenge across all three subtasks.

More links