Text-Based Reasoning About Vector Graphics

Abstract

While large multimodal models excel in broad vision-language benchmarks, theyoften struggle with tasks requiring precise perception of low-level visualdetails, such as comparing line lengths or solving simple mazes. In particular,this failure mode persists in question-answering tasks about vector graphics --images composed purely of 2D objects and shapes. To address this challenge, wepropose the Visually Descriptive Language Model (VDLM), which performstext-based reasoning about vector graphics. VDLM leverages Scalable VectorGraphics (SVG) for a more precise visual description and first uses anoff-the-shelf raster-to-SVG algorithm for encoding. Since existing languagemodels cannot understand raw SVGs in a zero-shot setting, VDLM then bridges SVGwith pretrained language models through a newly introduced intermediatesymbolic representation, Primal Visual Description (PVD), comprising primitiveattributes (e.g., shape, position, measurement) with their correspondingpredicted values. PVD is task-agnostic and represents visual primitives thatare universal across all vector graphics. It can be learned with procedurallygenerated (SVG, PVD) pairs and also enables the direct use of LLMs forgeneralization to complex reasoning tasks. By casting an image to a text-basedrepresentation, we can leverage the power of language models to learn alignmentfrom SVG to visual primitives and generalize to unseen question-answeringtasks. Empirical results show that VDLM achieves stronger zero-shot performancecompared to state-of-the-art LMMs, such as GPT-4V, in various low-levelmultimodal perception and reasoning tasks on vector graphics. We additionallypresent extensive analyses on VDLM's performance, demonstrating that ourframework offers better interpretability due to its disentangled perception andreasoning processes. Project page: https://mikewangwzhl.github.io/VDLM/

Quick Read (beta)

loading the full paper ...