Spanish TrOCR: leveraging transfer learning for language adaptation
Optical Character Recognition (OCR) technology has revolutionized how we interact with text in digital images. From digitizing printed documents to enabling text extraction from doctor handwriting, OCR systems have become indispensable tools across various applications.
At Qantev, we automatically process various types of documents in multiple languages. However, most of the available datasets in the literature are in English and the synthetic data generation methods do not consider specific problems related to Visual Rich Documents (VRDs).
In this blog post, we explain how to create a synthetic dataset in Spanish taking into account elements that the model will face when dealing with VRDs. We later fine tune TrOCR [1] using this dataset and test in the Spanish XFUND Spanish dataset. You can read more about it in our paper: https://arxiv.org/abs/2407.06950
The Spanish TrOCR models are available on hugging face: https://huggingface.co/qantev
The method to generate the dataset is available here on github: https://github.com/v-laurent/VRD-image-text-generator
Synthetic VRD dataset in Spanish:
To train an OCR system, we need a dataset composed of image-text pairs. The available methods to generate this kind of dataset, like trdg [5], are not suited for Visual Rich Documents because in their data augmentation techniques they don’t take into account common artifacts present in these kinds of documents.
In VRD, we may encounter artifacts such as text written inside boxes, horizontal and vertical lines present in the text. Therefore, in addition to the traditional OCR data augmentation such as random noise, rotation, gaussian blurring… we also include specific VRDs data augmentation techniques in our synthetic image-text dataset generation method.
Another artifact that we observed on real-life VRDs OCR applications, is the presence of text coming from the lines above or below due to a propagation error of the text detection algorithm. We observed that sometimes, especially on handwritten text, part of the text in the lines above and/or below is still present after cropping the detected text. Therefore, we also include this artifact in our dataset so the OCR can learn how to deal with it.
Fine-tuning TrOCR in Spanish:
TrOCR, introduced by Li et al [1], is a very popular OCR end-to-end Transformer model that uses an image transformer as the encoder and a text transformer as the decoder. Relying fully on the transformer architecture allows the model to be flexible on the size of the architecture and the weights initialization from pre-trained checkpoints.
In the paper, they propose three variants of the model: small (total parameters=62M), base (total parameters=334M) and large (total parameters=558M) versions. This diversity enables us to strike a balance between resource efficiency and parameter richness, thus enhancing the model’s capability to understand language nuances and image details. The pre-trained checkpoints in English were all made available on hugging face [6].
To fine-tune TrOCR, we initialized the model from the English Stage-1 checkpoints. We generated a dataset of 2M images and trained the model for 2 epochs in a single A100 80Gb GPU. The batch size and learning rate for every model along with a more detailed explanation about the training can be found in our paper [2].
Results
To benchmark our model, we compared it against EasyOCR in Spanish [7] and the Microsoft Azure OCR API [8]. EasyOCR is a famous open source OCR library that supports more than 80 languages. Microsoft Azure OCR is known for its performance and supports more than 100 languages in the printed format.
To evaluate the results, we use the XFUND Spanish dataset [9]. XFUND is a A Multilingual Form Understanding Benchmark that contains annotated forms in the printed format for 7 different languages. To evaluate the results, we don’t further fine-tune the model on the XFUND dataset, we evaluate it out-of-the-box, as we believe that a good OCR should be able to perform well on datasets from other domains.
We use two metrics to compare the different model performances, Character Error Rate (CER) and Word Error Rate (WER). For a more complete description of these metrics visit our paper [2].
We can see that the 3 versions of our model have considerable improvement over EasyOCR, making our model the best Spanish OCR open source model available at the moment. As expected Azure showed the best performance among all the tested models.
Conclusion
In this blog post we presented a recipe to train a TrOCR model in Spanish taking into account artifacts present in Visual Rich Documents. The training recipe and all the trained models are available open source.
It’s important to notice that these models only work for printed data and single line text. One important point is that our models only work on horizontal text, if you have vertical text you should rotate the image to use our model.
For more detailed explanations about this study, check our Arxiv Paper: https://arxiv.org/abs/2407.06950
References:
[1] https://arxiv.org/pdf/2109.10282
[2] https://arxiv.org/abs/2407.06950
[3] https://huggingface.co/qantev
[4] https://github.com/v-laurent/VRD-image-text-generator
[5] https://github.com/Belval/TextRecognitionDataGenerator
[6] https://huggingface.co/models?sort=trending&search=microsoft%2Ftrocr
[7] https://github.com/JaidedAI/EasyOCR
[8] https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr
[9] https://aclanthology.org/2022.findings-acl.253.pdf