Automated medical coding - Part 2
Introduction
Recap and Context
In the first post of this series, we explored the significance and main challenges of the automated medical coding process within the framework of a supervised deep learning model. More specifically, we detailed an approach that was considered state-of-the-art, the PLM-ICD [1], which utilizes transformer encoders pre-trained on medical vocabularies [2], and fine-tuned for code classification.
Yet, our AI team at Qantev is acutely aware of the unique challenges presented by rare medical codes. Such codes pose significant obstacles for supervised learning models, including those as advanced as PLM-ICD, primarily because of their rare presence in training datasets.
We value innovation and research and we are always looking to implement the latest research innovation in our products. This brings us to our review of a pivotal study by Boyle et al. [3], which poses an intriguing question: what are the potential benefits of using a Generative AI model for medical code inference?
Despite its simplicity, this is not a naïve question: in November 2023, Microsoft published an article [4] in which it compares models trained with special-purpose tuning (Google’s Med-PALM 2, BioGPT) with generalist foundation models (like GPT-4) used out-of-the-box for questions related to medicine. Up to this point, Google’s Med-PALM 2 achieved an impressive 86.5% score on the MedQA dataset [5], which contains questions for the US Medical Licensing Examination. However, with the publication of this new article, the state of the art is updated: a 90% score using GPT-4 combined with a set of innovative prompt engineering techniques, called Medprompt.
The Power of Prompt Engineering
This result demonstrates that prompting innovation can unlock deeper specialist capabilities and show that generalist foundation LLMs easily top prior leading results for question-answering datasets. Bringing this recent discussion to the realm of medical code inference, we can consider PLM-ICD as being a model based on special-purpose tuning (depending on a dataset of clinical notes and related medical codes for its training). This naturally leads us to ponder whether a generalist model could enhance our medical coding solutions.
The short answer to this question is “not necessarily.” However, with the right methodology and procedures, the approach we will review managed to surpass state-of-the-art performance across different metrics [3].
Similar to the methodology used by Microsoft, without any requirement for task-specific training or fine-tuning, the authors relied on a single key strategy: prompt engineering. Not just in the content of the prompt, but also through multiple calls traversing the ontology of medical codes.
To better understand this discussion, we will formally present the findings of the paper.
Classic Clinical Coder Strategy
The initial approach was based on the hypothesis that LLMs could be exceptional ICD coders right out-of-the-box. Therefore, a primary strategy involved providing the clinical note and a task requesting the assignment of the appropriate ICD-10 codes, along with their exact descriptions. Throughout the subsequent analyses, three models were used: GPT-3.5, GPT-4, and Llama-2.
Despite displaying the intended behavior, the model provided codes that were inconsistent with their own descriptions. For example, providing the code R51 for "Headache, unspecified", when the correct code is actually R51.9. This is intriguing, as it highlights a very direct limitation of these models’ capabilities in failing to handle the definition of medical codes.
Furthermore, it is interesting to observe the results of each model and how this reflects their predispositions and knowledge when facing this task. For example, Llama-2 proved to be more efficient in classifying the correct descriptions, though it made mistakes in the codes. On the other hand, GPT-4 tended to identify the codes accurately while making errors in the descriptions.
The presumption is that GPT-4, having been exposed to more structured datasets linking medical codes directly with their respective descriptions, likely developed a stronger proficiency in correlating clinical texts with the correct codes. Conversely, Llama-2, possibly benefiting from a dataset enriched with more detailed and contextual descriptions, demonstrated greater effectiveness in identifying these descriptions accurately, despite occasionally linking them to incorrect codes.
In any case, the mere fact that the model does not correctly associate the codes with their accurate descriptions is not a good sign. A new approach to this problem needs to be considered; one in which we do not assume that the LLM is an expert in medical codes. Instead, we consider that the model is very efficient in abstracting information and cross-referencing it with data it has access to. Therefore, the problem is reframed as information retrieval.
LMM Guided Tree-Search of Medical Codes Strategy
In this strategy, instead of assuming that all the knowledge of medical coding is intrinsic to the model, we will see it merely as an excellent agent in cross-referencing information. To this end, we will provide the model with a reference that can assist in finding the right code: the ICD ontology.
The International Classification of Diseases (ICD) ontology is structured as a hierarchical tree, where relationships between parent and child codes are defined by ‘is a’ semantics, indicating that one condition is a subtype or specific instance of another. For example, ‘Acute Nasopharyngitis’ is classified under ‘Upper Respiratory Infection’. Leveraging this structured hierarchy, one can utilize a large language model (LLM) to navigate this ontology effectively, aiming to pinpoint accurate ICD codes corresponding to clinical narratives.
The process starts at the ontology’s root, with the LLM guiding the exploration by selecting pertinent chapters or branches based on the input query. This selection is similar to navigating decision points, with the model evaluating which branches are relevant and should be pursued further. The procedure iterates recursively, traversing down the tree and refining the search scope at each level, based on the model’s recommendations.
As depicted in the hypothetical prompt-response illustration in Figure 4, this methodical traversal allows for a targeted search, culminating when all potential paths have been explored and the relevant ICD codes — such as ‘Acute Nasopharyngitis’ — are identified and compiled into the final set of predictions. Through this strategy, the tree’s hierarchical nature is exploited to enhance search efficiency and accuracy in identifying the most relevant diagnostic codes.
This strategy results in much more interesting outcomes than the previous approach, even surpassing some of the metrics achieved through the PLM-ICD. Moreover, it’s noteworthy that this improvement primarily occurred in the macro metrics; these do not take into account the frequency of a medical code’s appearance in the dataset. In other words, the use of out-of-the-box LLMs combined with the tree-search strategy leads to a better classification of rare codes. A plausible explanation for this outcome is that the PLM-ICD relies on supervised learning, which considers the data’s imbalance in the frequency of medical codes’ appearances. In the out-of-the-box approach, on the other hand, this imbalance does not influence the model.
Despite the obtained results, one of the major drawbacks of this strategy is the high number of queries needed to explore the ICD ontology. This translates into a high computational cost for inferring medical codes for each clinical note. Another negative point is that the correct code prediction depends on the path taken in the ICD tree: if the model fails to identify a high-hierarchy category, it will never have the opportunity to classify the correct code.
Among the three tested models, it is also observed that the OpenAI models (GPT-3.5 and GPT-4) were responsible for the best results. However, it is equally important to acknowledge the relative improvement of Llama-2 with this new technique, moving from an f1 macro score of 0.006 to 0.144.
These improvements are significant and, although not perfect, indicate this type of approach as a solid path on which promising developments can be made.
Conclusion
Technical Considerations
In light of this discussion, it can be stated that there is enormous potential to be unlocked in the out-of-the-box use of foundational models, however, an understanding of the problem is necessary so that the strategy formulation is well done. In the case of a simple QA problem, techniques like Medprompt have proven useful not only in the field of medicine but also in electrical engineering, machine learning, philosophy, accounting, law, nursing, and psychology.
However, unlike a QA problem, medical code inference is differentiated by the very high quantity and granularity of existing codes. This fact, coupled with the hierarchical nature of the ICD codes, calls for a solution that is adapted and thoughtfully designed to address these particularities.
The authors also emphasize the fact that there is no consensus regarding best practices for prompting in different foundational models. They could not find a common prompt that both Llama-2 and GPT-3.5/4 would adhere to, which forced them to employ separate prompt templates for each model. This needs to be taken into consideration and might account for the poor performance of Llama-2, compared to OpenAI’s models.
Qantev's Endeavors
The insights gained from this recent study significantly influence our strategies and development. We are particularly excited about the possibilities of using foundational Large Language Models (LLMs) enhanced by sophisticated prompt engineering. Our efforts are focused not just on taking advantage of the raw power of models (like Mistral, Llama-2, Meditron, etc.) but also on refining their application through innovative techniques. By integrating Chain-of-Thought approaches and tailored Few-Shot-learning templates, we are developing methods to navigate the complex ICD ontology effectively in a single prompt. This not only conserves computational resources but also dramatically reduces the processing time.
Furthermore, we recognize the limitations of generalist LLMs in handling specialized, hierarchical information inherent in medical coding. To address this, our team is also pioneering work with Supervised Fine-Tuning (SFT) to better equip these models with the nuanced understanding necessary for medical diagnostics. This approach enhances the model’s ability to abstract and interlink ICD information directly related to clinical notes, offering a more refined, accurate tool for medical coding.
As we continue to explore the capabilities of foundational models, Qantev is committed to enhancing medical coding through AI innovation. We are not just following the trends — we are participating in their progression, ensuring our solutions lead to practical, efficient, and transformative results in health insurance operations.
Bibliography
[1] C.W. Huang, S.C. Tsai, and Y.N. Chen, “PLM-ICD: Automatic ICD coding with pretrained language models,” arXiv preprint arXiv:2207.05289, 2022
[2] P. Lewis, M. Ott, J. Du, and V. Stoyanov. 2020. Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 146–157, Online. Association for Computational Linguistics.
[3] Boyle, J. S., Kascenas, A., Lok, P., Liakata, M., & O’Neil, A. Q. (2023). Automated clinical coding using off-the-shelf large language models. arXiv preprint arXiv:2310.06552.
[4] Nori, H. et al. (2023). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. arXiv preprint arXiv:2311.16452.
[5] Singhal, K. et al. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. arXiv preprint arXiv:2305.09617.