DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for variouscomputer vision tasks. In ViT, we divide the input image into patch tokens andprocess them through a stack of self attention blocks. However, unlikeConvolutional Neural Networks (CNN), ViTs simple architecture has noinformative inductive bias (e.g., locality,etc. ). Due to this, ViT requires alarge amount of data for pre-training. Various data efficient approaches (DeiT)have been proposed to train ViT on balanced datasets effectively. However,limited literature discusses the use of ViT for datasets with long-tailedimbalances. In this work, we introduce DeiT-LT to tackle the problem oftraining ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce anefficient and effective way of distillation from CNN via distillation DISTtoken by using out-of-distribution images and re-weighting the distillationloss to enhance focus on tail classes. This leads to the learning of localCNN-like features in early ViT blocks, improving generalization for tailclasses. Further, to mitigate overfitting, we propose distilling from a flatCNN teacher, which leads to learning low-rank generalizable features for DISTtokens across all ViT blocks. With the proposed DeiT-LT scheme, thedistillation DIST token becomes an expert on the tail classes, and theclassifier CLS token becomes an expert on the head classes. The experts help toeffectively learn features corresponding to both the majority and minorityclasses using a distinct set of tokens within the same ViT architecture. Weshow the effectiveness of DeiT-LT for training ViT from scratch on datasetsranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.

Quick Read (beta)

loading the full paper ...