AI paper index
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
One-line summary
An AI research paper on Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite.
Engineering notes
Engineering notes will be added by the aipentium editorial team.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为大语言模型、生成式AI、ChatGPT相关技术、计算机视觉、深度学习等高价值论文补充中文说明。
Original abstract
Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.
Links and sources
Need this topic turned into a technical roadmap?
aipentium can prepare a custom AI literature review, code map, dataset map, and B2B technology assessment.
Request B2B AI research
Comments