Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking

Abstract

Data augmentation techniques apply transformations to existing texts togenerate additional data. The transformations may produce low-quality texts,where the meaning of the text is changed and the text may even be mangledbeyond human comprehension. Analyzing the synthetically generated texts andtheir corresponding labels is slow and demanding. To winnow out texts withincorrect labels, we develop INSPECTOR, a human-in-the-loop data inspectiontechnique. INSPECTOR combines the strengths of provenance tracking techniqueswith assistive labeling. INSPECTOR allows users to group related texts by theirtransformation provenance, i.e., the transformations applied to the originaltext, or feature provenance, the linguistic features of the original text. Forassistive labeling, INSPECTOR computes metrics that approximate data quality,and allows users to compare the corresponding label of each text against thepredictions of a large language model. In a user study, INSPECTOR increases thenumber of texts with correct labels identified by 3X on a sentiment analysistask and by 4X on a hate speech detection task. The participants found groupingthe synthetically generated texts by their common transformation to be the mostuseful technique. Surprisingly, grouping texts by common linguistic featureswas perceived to be unhelpful. Contrary to prior work, our study finds that nosingle technique obviates the need for human inspection effort. This validatesthe design of INSPECTOR which combines both analysis of data provenance andassistive labeling to reduce human inspection effort.

Quick Read (beta)

loading the full paper ...