Abstract
Despite the recent progress on scaling multilingual machine translation (MT)to several under-resourced African languages, accurately measuring thisprogress remains challenging, since evaluation is often performed on n-grammatching metrics such as BLEU, which typically show a weaker correlation withhuman judgments. Learned metrics such as COMET have higher correlation;however, the lack of evaluation data with human ratings for under-resourcedlanguages, complexity of annotation guidelines like Multidimensional QualityMetrics (MQM), and limited language coverage of multilingual encoders havehampered their applicability to African languages. In this paper, we addressthese challenges by creating high-quality human evaluation data with simplifiedMQM guidelines for error detection and direct assessment (DA) scoring for 13typologically diverse African languages. Furthermore, we develop AfriCOMET:COMET evaluation metrics for African languages by leveraging DA data fromwell-resourced languages and an African-centric multilingual encoder(AfroXLM-R) to create the state-of-the-art MT evaluation metrics for Africanlanguages with respect to Spearman-rank correlation with human judgments(0.441).