IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

  • 2024-04-08 15:02:41
  • Indraneil Paul, Goran Glavaš, Iryna Gurevych
  • 0

Abstract

Code understanding and generation have fast become some of the most popularapplications of language models (LMs). Nonetheless, research on multilingualaspects of Code-LMs (i.e., LMs for code generation) such as cross-lingualtransfer between different programming languages, language-specific dataaugmentation, and post-hoc LM adaptation, alongside exploitation of datasources other than the original textual content, has been much sparser than fortheir natural language counterparts. In particular, most mainstream Code-LMshave been pre-trained on source code files alone. In this work, we investigatethe prospect of leveraging readily available compiler intermediaterepresentations (IR) - shared across programming languages - to improve themultilingual capabilities of Code-LMs and facilitate cross-lingual transfer. To this end, we first compile SLTrans, a parallel dataset consisting ofnearly 4M self-contained source code files coupled with respective intermediaterepresentations. Next, starting from various base Code-LMs (ranging in sizefrom 1.1B to 7.3B parameters), we carry out continued causal language modellingtraining on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2)align the IR constructs with respective constructs of various programminglanguages. Our resulting models, dubbed IRCoder, display sizeable andconsistent gains across a wide variety of code generation tasks and metrics,including prompt robustness, multilingual code completion, code understanding,and instruction following.

 

Quick Read (beta)

loading the full paper ...