Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Abstract

Driver activity classification is crucial for ensuring road safety, withapplications ranging from driver assistance systems to autonomous vehiclecontrol transitions. In this paper, we present a novel approach leveraginggeneralizable representations from vision-language models for driver activityclassification. Our method employs a Semantic Representation Late Fusion NeuralNetwork (SRLF-Net) to process synchronized video frames from multipleperspectives. Each frame is encoded using a pretrained vision-language encoder,and the resulting embeddings are fused to generate class probabilitypredictions. By leveraging contrastively-learned vision-languagerepresentations, our approach achieves robust performance across diverse driveractivities. We evaluate our method on the Naturalistic Driving ActionRecognition Dataset, demonstrating strong accuracy across many classes. Ourresults suggest that vision-language representations offer a promising avenuefor driver monitoring systems, providing both accuracy and interpretabilitythrough natural language descriptors.

Quick Read (beta)

loading the full paper ...