Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Abstract

Recent advancements in Large Language Models (LLMs) have led to thedevelopment of Video Large Multi-modal Models (Video-LMMs) that can handle awide range of video understanding tasks. These models have the potential to bedeployed in real-world applications such as robotics, AI assistants, medicalimaging, and autonomous vehicles. The widespread adoption of Video-LMMs in ourdaily lives underscores the importance of ensuring and evaluating their robustperformance in mirroring human-like reasoning and interaction capabilities incomplex, real-world contexts. However, existing benchmarks for Video-LMMsprimarily focus on general video comprehension abilities and neglect assessingtheir reasoning capabilities over complex videos in the real-world context, androbustness of these models through the lens of user prompts as text queries. Inthis paper, we present the Complex Video Reasoning and Robustness EvaluationSuite (CVRR-ES), a novel benchmark that comprehensively assesses theperformance of Video-LMMs across 11 diverse real-world video dimensions. Weevaluate 9 recent models, including both open-source and closed-sourcevariants, and find that most of the Video-LMMs, {especially open-source ones,}struggle with robustness and reasoning when dealing with complex videos. Basedon our analysis, we develop a training-free Dual-Step Contextual Prompting(DSCP) technique to enhance the performance of existing Video-LMMs. Ourfindings provide valuable insights for building the next generation ofhuman-centric AI systems with advanced robustness and reasoning capabilities.Our dataset and code are publicly available at:https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/.

Quick Read (beta)

loading the full paper ...