Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Abstract

Our objective is to discover and localize monotonic temporal changes in asequence of images. To achieve this, we exploit a simple proxy task of orderinga shuffled image sequence, with `time' serving as a supervisory signal sinceonly changes that are monotonic with time can give rise to the correctordering. We also introduce a flexible transformer-based model forgeneral-purpose ordering of image sequences of arbitrary length with built-inattribution maps. After training, the model successfully discovers andlocalizes monotonic changes while ignoring cyclic and stochastic ones. Wedemonstrate applications of the model in multiple video settings coveringdifferent scene and object types, discovering both object-level andenvironmental changes in unseen sequences. We also demonstrate that theattention-based attribution maps function as effective prompts for segmentingthe changing regions, and that the learned representations can be used fordownstream applications. Finally, we show that the model achieves the state ofthe art on standard benchmarks for ordering a set of images.

Quick Read (beta)

loading the full paper ...