Abstract
Generating Audio Description (AD) for movies is a challenging task thatrequires fine-grained visual understanding and an awareness of the charactersand their names. Currently, visual language models for AD generation arelimited by a lack of suitable training data, and also their evaluation ishampered by using performance measures not specialized to the AD domain. Inthis paper, we make three contributions: (i) We propose two approaches forconstructing AD datasets with aligned video data, and build training andevaluation datasets using these. These datasets will be publicly released; (ii)We develop a Q-former-based architecture which ingests raw video and generatesAD, using frozen pre-trained visual encoders and large language models; and(iii) We provide new evaluation metrics to benchmark AD quality that arewell-matched to human performance. Taken together, we improve the state of theart on AD generation.