Brainformers: Trading Simplicity for Efficiency

Abstract

Transformers are central to recent successes in natural language processingand computer vision. Transformers have a mostly uniform backbone where layersalternate between feed-forward and self-attention in order to build a deepnetwork. Here we investigate this design choice and find that more complexblocks that have different permutations of layer primitives can be moreefficient. Using this insight, we develop a complex block, named Brainformer,that consists of a diverse sets of layers such as sparsely gated feed-forwardlayers, dense feed-forward layers, attention layers, and various forms of layernormalization and activation functions. Brainformer consistently outperformsthe state-of-the-art dense and sparse Transformers, in terms of both qualityand efficiency. A Brainformer model with 8 billion activated parameters pertoken demonstrates 2x faster training convergence and 5x faster step timecompared to its GLaM counterpart. In downstream task evaluation, Brainformeralso demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaMwith a similar number of activated parameters. Finally, Brainformer largelyoutperforms a Primer dense model derived with NAS with similar computation pertoken on fewshot evaluations.

Quick Read (beta)

loading the full paper ...