DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer

Abstract

With the continuous development of OCR technology and the expansion ofapplication fields, text recognition in complex scenes has become a keychallenge. Factors such as multiple fonts, mixed scenes and complex layoutsseriously affect the recognition accuracy of traditional OCR models. AlthoughOCR models based on deep learning have performed well in specific fields orsimilar datasets in recent years, the generalization ability and robustness ofthe model are still a big challenge when facing complex environments withmultiple scenes. Furthermore, training an OCR model from scratch or fine-tuningall parameters is very demanding on computing resources and inference time,which limits the flexibility of its application. This study focuses on afundamental aspect of mixed text recognition in response to the challengesmentioned above, which involves effectively fine-tuning the pre-trained basicOCR model to demonstrate exceptional performance across various downstreamtasks. To this end, we propose a parameter-efficient mixed text recognitionmethod based on pre-trained OCR Transformer, namely DLoRA-TrOCR. This methodembeds DoRA into the image encoder and LoRA into the internal structure of thetext decoder, enabling efficient parameter fine-tuning for downstream tasks.Experimental results show that compared to similar parameter adjustmentmethods, our model DLoRA-TrOCR has the smallest number of parameters andperforms better. It can achieve state-of-the-art performance on complex scenedatasets involving simultaneous recognition of mixed handwritten, printed andstreet view texts.

Quick Read (beta)

loading the full paper ...