Abstract
We propose a hybrid framework for consistently producing high-quality objecttracks by combining an automated object tracker with little human input. Thekey idea is to tailor a module for each dataset to intelligently decide when anobject tracker is failing and so humans should be brought in to re-localize anobject for continued tracking. Our approach leverages self-supervised learningon unlabeled videos to learn a tailored representation for a target object thatis then used to actively monitor its tracked region and decide when the trackerfails. Since labeled data is not needed, our approach can be applied to novelobject categories. Experiments on three datasets demonstrate our methodoutperforms existing approaches, especially for small, fast moving, or occludedobjects.