Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Abstract

This paper investigates the performance of the Contrastive Language-ImagePre-training (CLIP) when scaled down to limited computation budgets. We exploreCLIP along three dimensions: data, architecture, and training strategies. Withregards to data, we demonstrate the significance of high-quality training dataand show that a smaller dataset of high-quality data can outperform a largerdataset with lower quality. We also examine how model performance varies withdifferent dataset sizes, suggesting that smaller ViT models are better suitedfor smaller datasets, while larger models perform better on larger datasetswith fixed compute. Additionally, we provide guidance on when to choose aCNN-based architecture or a ViT-based architecture for CLIP training. Wecompare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+DataAugmentation - and show that the choice of training strategy depends on theavailable compute resource. Our analysis reveals that CLIP+Data Augmentationcan achieve comparable performance to CLIP using only half of the trainingdata. This work provides practical insights into how to effectively train anddeploy CLIP models, making them more accessible and affordable for practicaluse in various applications.

Quick Read (beta)

loading the full paper ...