IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

Abstract

Code understanding and generation have fast become some of the most popularapplications of language models (LMs). Nonetheless, research on multilingualaspects of Code-LMs (i.e., LMs for code generation) such as cross-lingualtransfer between different programming languages, language-specific dataaugmentation, and post-hoc LM adaptation, alongside exploitation of datasources other than the original textual content, has been much sparser than fortheir natural language counterparts. In particular, most mainstream Code-LMshave been pre-trained on source code files alone. In this work, we investigatethe prospect of leveraging readily available compiler intermediaterepresentations (IR) - shared across programming languages - to improve themultilingual capabilities of Code-LMs and facilitate cross-lingual transfer. To this end, we first compile SLTrans, a parallel dataset consisting ofnearly 4M self-contained source code files coupled with respective intermediaterepresentations. Next, starting from various base Code-LMs (ranging in sizefrom 1.1B to 7.3B parameters), we carry out continued causal language modellingtraining on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2)align the IR constructs with respective constructs of various programminglanguages. Our resulting models, dubbed IRCoder, display sizeable andconsistent gains across a wide variety of code generation tasks and metrics,including prompt robustness, multilingual code completion, code understanding,and instruction following.

Quick Read (beta)

loading the full paper ...