AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

Abstract

Recent advances in large pre-trained vision-language models have demonstratedremarkable performance on zero-shot downstream tasks. Building upon this,recent studies, such as CoOp and CoCoOp, have proposed the use of promptlearning, where context within a prompt is replaced with learnable vectors,leading to significant improvements over manually crafted prompts. However, theperformance improvement for unseen classes is still marginal, and to tacklethis problem, data augmentation has been frequently used in traditionalzero-shot learning techniques. Through our experiments, we have identifiedimportant issues in CoOp and CoCoOp: the context learned through traditionalimage augmentation is biased toward seen classes, negatively impactinggeneralization to unseen classes. To address this problem, we proposeadversarial token embedding to disentangle low-level visual augmentationfeatures from high-level class information when inducing bias in learnableprompts. Through our novel mechanism called "Adding Attributes to PromptLearning", AAPL, we guide the learnable context to effectively extract textfeatures by focusing on high-level features for unseen classes. We haveconducted experiments across 11 datasets, and overall, AAPL shows favorableperformances compared to the existing methods in few-shot learning, zero-shotlearning, cross-dataset, and domain generalization tasks.

Quick Read (beta)

loading the full paper ...