A benchmark for evaluating machine translation metrics for dialects without standard orthography

Disclaimer: This article was written in 2023 and describes the situation before Textshuttle’s merger with Supertext and the subsequent relaunch at supertext.com.

In order to make progress in natural language processing, it is important to be aware of the limitations of the evaluation metrics used. In this work, we evaluate how robust metrics are for non-standardised dialects, i.e. language varieties that do not have a standard orthography. To investigate this, we collect a dataset of human translations and human judgments of automatic machine translations from English into two Swiss German dialects. We further create a challenge set for dialect variation and benchmark existing metrics’ performances. Our results show that existing metrics cannot reliably evaluate Swiss German text generation outputs, especially on the segment level. We propose initial design adaptations that increase robustness in the face of non-standardised dialects, although there remains much room for further improvement. The dataset, code and models are available on GitHub.

Read the entire research paper on arXiv

News

Translate right inside your work environment with the new Supertext add-ins for Microsoft Office

February 16, 2026

Angela Lanza-Mariani

News

Translate even more consistently – with the new team glossaries and style guides

January 21, 2026

Angela Lanza-Mariani

Translate from Polish into English online – accurate, fast and free with advanced AI

December 5, 2025

Angela Lanza-Mariani

A benchmark for evaluating machine translation metrics for dialects without standard orthography

More posts

Translate right inside your work environment with the new Supertext add-ins for Microsoft Office

Translate even more consistently – with the new team glossaries and style guides

Translate from Polish into English online – accurate, fast and free with advanced AI