How to extract the right information on claims and medical documents
This is the second article of the OCR series where we explain how the Document Intelligence Pipeline at Qantev works. The goal of our pipeline is to automate the process of extracting information from the documents received by our insurance clients.
In the first part 1 article [1], we described our OCR algorithm “MoustacheOCR”, which is able to read a scanned document in different languages. However, OCR is only the first step in our pipeline, once we know the text and its position inside the document, we need to extract the relevant information from it. In this article, we will discuss our Information Extraction Pipeline, which aims to retrieve some specific information, from a scanned document already processed by our OCR.
How to extract information from a pre-approval document?
After applying the OCR to the scanned document, we need to retrieve the relevant information from it. The medical pre-approval documents that we process contain information regarding the patient name, patient address, patient phone number, patient policy number, doctor name, medical diagnostic, proposed treatment…
When you visit a hospital, the doctor fills your information in a medical form that is later sent to the insurance company. Have you ever wondered how the insurer knows your name?
Usually, an operator will review the document looking for a field called something like “Patient Name” next to an empty space where your name will probably be handwritten.
We mimic this approach by first applying MoustacheOCR to recognize all the text in the form and then identifying the “Patient Name” keyword and extracting the text field right after it. This approach can be done multiple times for all the keywords over the documents to extract all the information that we need and the job is done!
There is only one problem!
The previous scenario would only work in a world where all the medical forms received by health insurers are standardized! In real life we are far from that and documents are not always standardized! How many documents have you filled where you had to fill the information below the question, or even where you had all the questions one after the other and you should start filling only after the last question? Because of this, there is no simple rule to extract the information that we want.
In the early days at Qantev, we opted for a template based approach, where we developed dozens of rules that would take into account all of the scenarios. However, every time we had a new template or a new hospital, we would have to update the rules. Even worse, for each new client, we had to create new rules from scratch to address the new templates.
Therefore, at Qantev, we developed an end-to-end Deep Learning algorithm that solves the problem in a more generic way, where we would have one algorithm that could be trained and/or adapted for each client and would work for different templates. We approach it as a Named Entity Recognition problem [2]. NER is a classical problem of NLP that classifies each group of words as an entity. In our case, the entities are the information that we want to extract, like “patient name”, “patient address”, “patient phone number”, “doctor name”…
After applying this algorithm, we realized that it had some trouble distinguishing between entities that represent similar information, like the patient name and doctor name, or the patient phone number and the doctor phone number. After seeing that using only the text didn’t work, we went back to the drawing board with a simple question: How does a human analyse a document and retrieve the information?
We, humans, understand information and make actions using different modalities. You can think of modalities as different sensations that we experience like see, read, feel, listen… So imagine our problem, we have a document written in English and you want to retrieve the patient’s name, how do you retrieve it?
First, we need to see the document and understand its structure. Second, we need to read parts of the document to understand its content. Third, we need to locate each information so we can understand the relationship between them.
Regarding the first approach, where we used just the text as input data, the algorithm doesn’t know the relationship between the words, because it doesn’t have their positions. If we add just the text and its position, we still have the problem that we face with different templates. Indeed, the positional relationship is different for different templates. Therefore, we also need to include the visual modality, so the algorithm will also have this structure information to make its decision.
In the end, we still approach the problem as a Named Entity Recognition, the only difference is that our algorithm uses all of these different modalities to identify the entities, solving the problem of similar entities like “patient name” and “doctor name”, leading to much better results overall.
The model
Now let’s go a little bit deeper into the model’s architecture. As previously mentioned, we need three different modalities in the same model: visual, spatial and textual, so how to input all of them together in a model?
Traditional models like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) or even standard Multi Layer Perceptrons (MLPs) are designed to work with specific types of data. However, more recent architectures like Transformers work with different modalities. This is similar for example to the Data2Vec [3] algorithm, that is a single transformer model trained on text, audio and image that has achieved state of the art performance for the three tasks at the same time.
Thus, we also chose to use a Transformer architecture in our multimodal problem. We encode each modality into an embedding space and concatenate these embedding spaces to input them to a Transformer model. The output of the transformer model is the entity label of each token.
Below, you can see an example of how our algorithm works in practice. We first input the document into our OCR, which outputs the location and transcription of all the text in the document. We group location and transcription as a token. We then input the image, the location and the text for each token into our multimodal transformer architecture that outputs the entity of each token. We then fill a csv table with the retrieved information.
Conclusion
In this article, we explained how Qantev’s information extraction pipeline works. Our approach aims to simulate the human way of thinking when extracting information from a document, using visual, spatial and textual modalities. A notable trait of our approach is that it is template agnostic, which allows us to easily deploy our solution to different insurers across the world while being able to handle different forms submitted by all the hospitals in their provider network.
References
[1] https://medium.com/p/c26da913933f
[2] https://en.wikipedia.org/wiki/Named-entity_recognition
[3] https://arxiv.org/pdf/2202.03555.pdf