SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Abstract

Tokenization is widely used in large language models because it significantlyimproves performance. However, tokenization imposes several disadvantages, suchas performance biases, increased adversarial vulnerability, decreasedcharacter-level modeling performance, and increased modeling complexity. Toaddress these disadvantages without sacrificing performance, we proposeSpaceByte, a novel byte-level decoder architecture that closes the performancegap between byte-level and subword autoregressive language modeling. SpaceByteconsists of a byte-level Transformer model, but with extra larger transformerblocks inserted in the middle of the layers. We find that performance issignificantly improved by applying these larger blocks only after certainbytes, such as space characters, which typically denote word boundaries. Ourexperiments show that for a fixed training and inference compute budget,SpaceByte outperforms other byte-level architectures and roughly matches theperformance of tokenized Transformer architectures.

Quick Read (beta)

loading the full paper ...