Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot

Abstract

Many multilingual communities, including numerous in Africa, frequentlyengage in code-switching during conversations. This behaviour stresses the needfor natural language processing technologies adept at processing code-switchedtext. However, data scarcity, particularly in African languages, poses asignificant challenge, as many are low-resourced and under-represented. In thisstudy, we prompted GPT 3.5 to generate Afrikaans--English and Yoruba--Englishcode-switched sentences, enhancing diversity using topic-keyword pairs,linguistic guidelines, and few-shot examples. Our findings indicate that thequality of generated sentences for languages using non-Latin scripts, likeYoruba, is considerably lower when compared with the high Afrikaans-Englishsuccess rate. There is therefore a notable opportunity to refine promptingguidelines to yield sentences suitable for the fine-tuning of language models.We propose a framework for augmenting the diversity of synthetically generatedcode-switched data using GPT and propose leveraging this technology to mitigatedata scarcity in low-resourced languages, underscoring the essential role ofnative speakers in this process.

Quick Read (beta)

loading the full paper ...