Compass: Large Multilingual Language Model for South-east Asia

  • 2024-04-14 12:48:33
  • Sophia Maria
  • 0

Abstract

Large language models have exhibited significant proficiency in languagesendowed with extensive linguistic resources, such as English and Chinese.Nevertheless, their effectiveness notably diminishes when applied to languagescharacterized by limited linguistic resources, particularly within theSoutheast Asian linguistic landscape, such as Indonesian. The scarcity oflinguistic resources for these languages presents challenges associated withinadequate training, restricted vocabulary coverage, and challenging evaluationprocesses. In response to these exigencies, we have introduced CompassLLM, alarge multilingual model specifically tailored for Southeast Asian languages,with the primary aim of supporting the developmental requirements of Shopee.Our methodology encompasses several key strategies. To progressively enhancemultilingual proficiencies, we implemented a multi-stage pre-training strategyintegrated with curriculum learning, gradually intensifying the focus onlow-resource languages. Concurrently, to better accommodate low-resource humaninstructions, we curated and generated a repository of high-qualitymultilingual human instructions, culminating the CompassLLM-SFT model throughsupervised instruction fine-tuning. Finally, to reinforce the model's alignmentwith human preference behaviors, we have embraced the principle of DirectPreference Optimization (DPO) to obtain CompassLLM-DPO model. Preliminaryevaluation of the CompassLLM model yields promising results, with our modelsurpassing benchmark models like Vicuna-7b-v1.5, Sealion, Falcon and SeaLLM,across diverse evaluation tasks, as verified through both automated andhuman-driven assessments. Notably, our model exhibits its superior performancein South-east Asia languages, such as Indonesian language.

 

Quick Read (beta)

loading the full paper ...