Abstract
Large-scale vision-language models have demonstrated impressive skill inhandling tasks that involve both areas. Nevertheless, these models frequentlyexperience significant issues with generating inaccurate information, which ishallucination. In this study, we concentrate on a specific type ofhallucination-number hallucination, referring to models incorrectly identifyingthe number of certain objects in pictures. We perform quantitative evaluationsregarding number hallucination, showing it to be critical in major open-sourcelarge vision-language models. Furthermore, we utilizes two related tasks toconduct an in-depth analysis of number hallucination, revealing the severeinner and outer inconsistency among all tasks. Based on this examination, wedevise a training approach aimed at improving consistency to reduce numberhallucinations, which leads to an 8% enhancement in performance over directfinetuning methods. Our code and dataset will be released to the community.