

Chantal Amrhein
January 29, 2024
Machine translation meta-evaluation through translation accuracy challenge sets
Disclaimer: This article was written in 2024 and describes the situation before Textshuttle’s merger with Supertext and the subsequent relaunch at supertext.com.
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement. However, these results are often obtained by averaging predictions across large test sets without any insights into the strengths and weaknesses of these metrics across different error types. Challenge sets are used to probe specific dimensions of metric behaviour, but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs.
We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from basic alterations at the word/character level to more intricate errors based on discourse and real-world knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metrics’ performance, assess their incremental performance over successive campaigns, and measure their sensitivity to a range of linguistic phenomena. We also investigate claims that large language models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by providing a more holistic evaluation that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages.
Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. Our analyses indicate that most metrics ignore the source sentence, tend to prefer surface-level overlap and end up incorporating properties of base models which are not always beneficial. To further encourage detailed evaluation beyond singular scores, we expand ACES to include error span annotations, denoted as SPAN-ACES, and use this dataset to evaluate span-based error metrics, showing that these metrics also need considerable improvement.
Finally, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing strategies to explicitly focus on the source sentence, focusing on semantic content rather than relying on lexical overlap, and choosing the right base model for representations.


