Training-Free Unsupervised Prompt for Vision-Language Models

Abstract

Prompt learning has become the most effective paradigm for adapting largepre-trained vision-language models (VLMs) to downstream tasks. Recently,unsupervised prompt tuning methods, such as UPL and POUF, directly leveragepseudo-labels as supervisory information to fine-tune additional adaptationmodules on unlabeled data. However, inaccurate pseudo labels easily misguidethe tuning process and result in poor representation capabilities. In light ofthis, we propose Training-Free Unsupervised Prompts (TFUP), which maximallypreserves the inherent representation capabilities and enhances them with aresidual connection to similarity-based prediction probabilities in atraining-free and labeling-free manner. Specifically, we integrate bothinstance confidence and prototype scores to select representative samples,which are used to customize a reliable Feature Cache Model (FCM) fortraining-free inference. Then, we design a Multi-level Similarity Measure (MSM)that considers both feature-level and semantic-level similarities to calculatethe distance between each test image and the cached sample as the weight of thecorresponding cached label to generate similarity-based predictionprobabilities. In this way, TFUP achieves surprising performance, evensurpassing the training-base method on multiple classification datasets. Basedon our TFUP, we propose a training-based approach (TFUP-T) to further boost theadaptation performance. In addition to the standard cross-entropy loss, TFUP-Tadopts an additional marginal distribution entropy loss to constrain the modelfrom a global perspective. Our TFUP-T achieves new state-of-the-artclassification performance compared to unsupervised and few-shot adaptationapproaches on multiple benchmarks. In particular, TFUP-T improves theclassification accuracy of POUF by 3.3% on the most challenging Domain-Netdataset.

Quick Read (beta)

loading the full paper ...