Abstract
Zero-shot learning has been extensively investigated in the broader field ofvisual recognition, attracting significant interest recently. However, thecurrent work on zero-shot learning in document image classification remainsscarce. The existing studies either focus exclusively on zero-shot inference,or their evaluation does not align with the established criteria of zero-shotevaluation in the visual recognition domain. We provide a comprehensivedocument image classification analysis in Zero-Shot Learning (ZSL) andGeneralized Zero-Shot Learning (GZSL) settings to address this gap. Ourmethodology and evaluation align with the established practices of this domain.Additionally, we propose zero-shot splits for the RVL-CDIP dataset.Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhancesthe zero-shot learning capabilities of CLIP. CICA consists of a novel 'contentmodule' designed to leverage any generic document-related textual information.The discriminative features extracted by this module are aligned with CLIP'stext and image features using a novel 'coupled-contrastive' loss. Our moduleimproves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on theRVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parametersto CLIP. Our work sets the direction for future research in zero-shot documentclassification.