Language Imbalance Can Boost Cross-lingual Generalisation

Abstract

Multilinguality is crucial for extending recent advancements in languagemodelling to diverse linguistic communities. To maintain high performance whilerepresenting multiple languages, multilingual models ideally alignrepresentations, allowing what is learned in one language to generalise toothers. Prior research has emphasised the importance of parallel data andshared vocabulary elements as key factors for such alignment. In this study, weinvestigate an unintuitive novel driver of cross-lingual generalisation:language imbalance. In controlled experiments on perfectly equivalent clonedlanguages, we observe that the existence of a predominant language duringtraining boosts the performance of less frequent languages and leads tostronger alignment of model representations across languages. Furthermore, wefind that this trend is amplified with scale: with large enough models or longenough training, we observe that bilingual training data with a 90/10 languagesplit yields better performance on both languages than a balanced 50/50 split.Building on these insights, we design training schemes that can improveperformance in all cloned languages, even without altering the training data.As we extend our analysis to real languages, we find that infrequent languagesstill benefit from frequent ones, yet whether language imbalance causescross-lingual generalisation there is not conclusive.

Quick Read (beta)

loading the full paper ...