AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

  • 2024-04-11 18:38:09
  • Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Marek Masiak, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Xuanli He, Sofia Bourhim, Andiswa Bukula, Muhidin Mohamed, Temitayo Olatoye, Tosin Adewumi, Hamam Mokayede, Christine Mwase, Wangui Kimotho, Foutse Yuehgoh, Anuoluwapo Aremu, Jessica Ojo, Shamsuddeen Hassan Muhammad, Salomey Osei, Abdul-Hakeem Omotayo, Chiamaka Chukwuneke, Perez Ogayo, Oumaima Hourrane, Salma El Anigri, Lolwethu Ndolela, Thabiso Mangwana, Shafie Abdi Mohamed, Ayinde Hassan, Oluwabusayo Olufunke Awoyomi, Lama Alkhaled, Sana Al-Azzawi, Naome A. Etori, Millicent Ochieng, Clemencia Siro, Samuel Njoroge, Eric Muchiri, Wangari Kimotho, Lyse Naomi Wamba Momo, Daud Abolade, Simbiat Ajao, Iyanuoluwa Shode, Ricky Macharm, Ruqayya Nasir Iro, Saheed S. Abdullahi, Stephen E
  • 0

Abstract

Despite the recent progress on scaling multilingual machine translation (MT)to several under-resourced African languages, accurately measuring thisprogress remains challenging, since evaluation is often performed on n-grammatching metrics such as BLEU, which typically show a weaker correlation withhuman judgments. Learned metrics such as COMET have higher correlation;however, the lack of evaluation data with human ratings for under-resourcedlanguages, complexity of annotation guidelines like Multidimensional QualityMetrics (MQM), and limited language coverage of multilingual encoders havehampered their applicability to African languages. In this paper, we addressthese challenges by creating high-quality human evaluation data with simplifiedMQM guidelines for error detection and direct assessment (DA) scoring for 13typologically diverse African languages. Furthermore, we develop AfriCOMET:COMET evaluation metrics for African languages by leveraging DA data fromwell-resourced languages and an African-centric multilingual encoder(AfroXLM-R) to create the state-of-the-art MT evaluation metrics for Africanlanguages with respect to Spearman-rank correlation with human judgments(0.441).

 

Quick Read (beta)

loading the full paper ...