Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

Abstract

The world of language models is going through turbulent times, better andever larger models are coming out at an unprecedented speed. However, we arguethat, especially for the scientific community, encoder models of up to 1billion parameters are still very much needed, their primary usage being inenriching large collections of data with metadata necessary for downstreamresearch. We investigate the best way to ensure the existence of such encodermodels on the set of very closely related languages - Croatian, Serbian,Bosnian and Montenegrin, by setting up a diverse benchmark for these languages,and comparing the trained-from-scratch models with the new models constructedvia additional pretraining of existing multilingual models. We show thatcomparable performance to dedicated from-scratch models can be obtained byadditionally pretraining available multilingual models even with a limitedamount of computation. We also show that neighboring languages, in our caseSlovenian, can be included in the additional pretraining with little to no lossin the performance of the final model.

Quick Read (beta)

loading the full paper ...