The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

Abstract

Deploying large language models (LLMs) encounters challenges due to intensivecomputational and memory requirements. Our research examines vocabularytrimming (VT) inspired by restricting embedding entries to the language ofinterest to bolster time and memory efficiency. While such modifications havebeen proven effective in tasks like machine translation, tailoring them to LLMsdemands specific modifications given the diverse nature of LLM applications. Weapply two language heuristics to trim the full vocabulary - Unicode-basedscript filtering and corpus-based selection - to different LLM families andsizes. The methods are straightforward, interpretable, and easy to implement.It is found that VT reduces the memory usage of small models by nearly 50% andhas an upper bound of 25% improvement in generation speed. Yet, we reveal thelimitations of these methods in that they do not perform consistently well foreach language with diminishing returns in larger models.

Quick Read (beta)

loading the full paper ...