Brainformers: Trading Simplicity for Efficiency

  • 2024-04-25 06:46:01
  • Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laudon, Jeff Dean
  • 0

Abstract

Transformers are central to recent successes in natural language processingand computer vision. Transformers have a mostly uniform backbone where layersalternate between feed-forward and self-attention in order to build a deepnetwork. Here we investigate this design choice and find that more complexblocks that have different permutations of layer primitives can be moreefficient. Using this insight, we develop a complex block, named Brainformer,that consists of a diverse sets of layers such as sparsely gated feed-forwardlayers, dense feed-forward layers, attention layers, and various forms of layernormalization and activation functions. Brainformer consistently outperformsthe state-of-the-art dense and sparse Transformers, in terms of both qualityand efficiency. A Brainformer model with 8 billion activated parameters pertoken demonstrates 2x faster training convergence and 5x faster step timecompared to its GLaM counterpart. In downstream task evaluation, Brainformeralso demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaMwith a similar number of activated parameters. Finally, Brainformer largelyoutperforms a Primer dense model derived with NAS with similar computation pertoken on fewshot evaluations.

 

Quick Read (beta)

loading the full paper ...