Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Abstract

A song is a combination of singing voice and accompaniment. However, existingworks focus on singing voice synthesis and music generation independently.Little attention was paid to explore song synthesis. In this work, we propose anovel task called text-to-song synthesis which incorporating both vocals andaccompaniments generation. We develop Melodist, a two-stage text-to-song methodthat consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A)synthesis. Melodist leverages tri-tower contrastive pretraining to learn moreeffective text representation for controllable V2A synthesis. A Chinese songdataset mined from a music website is built up to alleviate data scarcity forour research. The evaluation results on our dataset demonstrate that Melodistcan synthesize songs with comparable quality and style consistency. Audiosamples can be found in https://text2songMelodist.github.io/Sample/.

Quick Read (beta)

loading the full paper ...