Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Abstract

Recent language models use subwording mechanisms to handleOut-of-Vocabulary(OOV) words seen during test time and, their generationcapacity is generally measured using perplexity, an intrinsic metric. It isknown that increasing the subword granularity results in a decrease ofperplexity value. However, the study of how subwording affects theunderstanding capacity of language models has been very few and only limited toa handful of languages. To reduce this gap we used 6 different tokenizationschemes to pretrain relatively small language models in Nepali and used therepresentations learned to finetune on several downstream tasks. Althoughbyte-level BPE algorithm has been used in recent models like GPT, RoBERTa weshow that on average they are sub-optimal in comparison to algorithms such asSentencePiece in finetuning performances for Nepali. Additionally, similarrecent studies have focused on the Bert-based language model. We, however,pretrain and finetune sequential transformer-based language models.

Quick Read (beta)

loading the full paper ...