Measuring Cross-lingual Transfer in Bytes

Abstract

Multilingual pretraining has been a successful solution to the challengesposed by the lack of resources for languages. These models can transferknowledge to target languages with minimal or no examples. Recent researchsuggests that monolingual models also have a similar capability, but themechanisms behind this transfer remain unclear. Some studies have exploredfactors like language contamination and syntactic similarity. An emerging lineof research suggests that the representations learned by language modelscontain two components: a language-specific and a language-agnostic component.The latter is responsible for transferring a more universal knowledge. However,there is a lack of comprehensive exploration of these properties across diversetarget languages. To investigate this hypothesis, we conducted an experimentinspired by the work on the Scaling Laws for Transfer. We measured the amountof data transferred from a source language to a target language and found thatmodels initialized from diverse languages perform similarly to a targetlanguage in a cross-lingual setting. This was surprising because the amount ofdata transferred to 10 diverse target languages, such as Spanish, Korean, andFinnish, was quite similar. We also found evidence that this transfer is notrelated to language contamination or language proximity, which strengthens thehypothesis that the model also relies on language-agnostic knowledge. Ourexperiments have opened up new possibilities for measuring how much datarepresents the language-agnostic representations learned during pretraining.

Quick Read (beta)

loading the full paper ...