RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

Abstract

In this paper we propose a novel modification of Contrastive Language-ImagePre-Training (CLIP) guidance for the task of unsupervised backlit imageenhancement. Our work builds on the state-of-the-art CLIP-LIT approach, whichlearns a prompt pair by constraining the text-image similarity between a prompt(negative/positive sample) and a corresponding image (backlit image/well-litimage) in the CLIP embedding space. Learned prompts then guide an imageenhancement network. Based on the CLIP-LIT framework, we propose two novelmethods for CLIP guidance. First, we show that instead of tuning prompts in thespace of text embeddings, it is possible to directly tune their embeddings inthe latent space without any loss in quality. This accelerates training andpotentially enables the use of additional encoders that do not have a textencoder. Second, we propose a novel approach that does not require any prompttuning. Instead, based on CLIP embeddings of backlit and well-lit images fromtraining data, we compute the residual vector in the embedding space as asimple difference between the mean embeddings of the well-lit and backlitimages. This vector then guides the enhancement network during training,pushing a backlit image towards the space of well-lit images. This approachfurther dramatically reduces training time, stabilizes training and produceshigh quality enhanced images without artifacts, both in supervised andunsupervised training regimes. Additionally, we show that residual vectors canbe interpreted, revealing biases in training data, and thereby enablingpotential bias correction.

Quick Read (beta)

loading the full paper ...