CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning

Abstract

Zero-shot learning (ZSL) enables the recognition of novel classes byleveraging semantic knowledge transfer from known to unknown categories. Thisknowledge, typically encapsulated in attribute descriptions, aids inidentifying class-specific visual features, thus facilitating visual-semanticalignment and improving ZSL performance. However, real-world challenges such asdistribution imbalances and attribute co-occurrence among instances oftenhinder the discernment of local variances in images, a problem exacerbated bythe scarcity of fine-grained, region-specific attribute annotations. Moreover,the variability in visual presentation within categories can also skewattribute-category associations. In response, we propose a bidirectionalcross-modal ZSL approach CREST. It begins by extracting representations forattribute and visual localization and employs Evidential Deep Learning (EDL) tomeasure underlying epistemic uncertainty, thereby enhancing the model'sresilience against hard negatives. CREST incorporates dual learning pathways,focusing on both visual-category and attribute-category alignments, to ensurerobust correlation between latent and observable spaces. Moreover, we introducean uncertainty-informed cross-modal fusion technique to refine visual-attributeinference. Extensive experiments demonstrate our model's effectiveness andunique explainability across multiple datasets. Our code and data are availableat: https://github.com/JethroJames/CREST.

Quick Read (beta)

loading the full paper ...