Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Abstract

Background: Software Vulnerability (SV) prediction in emerging languages isincreasingly important to ensure software security in modern systems. However,these languages usually have limited SV data for developing high-performingprediction models. Aims: We conduct an empirical study to evaluate the impactof SV data scarcity in emerging languages on the state-of-the-art SV predictionmodel and investigate potential solutions to enhance the performance. Method:We train and test the state-of-the-art model based on CodeBERT with and withoutdata sampling techniques for function-level and line-level SV prediction inthree low-resource languages - Kotlin, Swift, and Rust. We also assess theeffectiveness of ChatGPT for low-resource SV prediction given its recentsuccess in other domains. Results: Compared to the original work in C/C++ withlarge data, CodeBERT's performance of function-level and line-level SVprediction significantly declines in low-resource languages, signifying thenegative impact of data scarcity. Regarding remediation, data samplingtechniques fail to improve CodeBERT; whereas, ChatGPT showcases promisingresults, substantially enhancing predictive performance by up to 34.4% for thefunction level and up to 53.5% for the line level. Conclusion: We havehighlighted the challenge and made the first promising step for low-resource SVprediction, paving the way for future research in this direction.

Quick Read (beta)

loading the full paper ...