Learning Mutually Informed Representations for Characters and Subwords

Abstract

Most pretrained language models rely on subword tokenization, which processestext as a sequence of subword tokens. However, different granularities of text,such as characters, subwords, and words, can contain different kinds ofinformation. Previous studies have shown that incorporating multiple inputgranularities improves model generalization, yet very few of them outputsuseful representations for each granularity. In this paper, we introduce theentanglement model, aiming to combine character and subword language models.Inspired by vision-language models, our model treats characters and subwords asseparate modalities, and it generates mutually informed representations forboth granularities as output. We evaluate our model on text classification,named entity recognition, POS-tagging, and character-level sequence labeling(intraword code-switching). Notably, the entanglement model outperforms itsbackbone language models, particularly in the presence of noisy texts andlow-resource languages. Furthermore, the entanglement model even outperformslarger pre-trained models on all English sequence labeling tasks andclassification tasks. We make our code publically available.

Quick Read (beta)

loading the full paper ...