In a new study, scientists at Beth Israel Deaconess Medical Center (BIDMC) compared a large language modelâs clinical reasoning capabilities against human physician counterparts. The investigators used the revised-IDEA (r-IDEA) score, which is a commonly used tool to assess clinical reasoning.
The study entailed giving a GPT-4 powered chatbot, 21 attending physicians, and 18 resident physicians 20 clinical cases to establish diagnostic reasoning for and work through. All three sets of answers were then evaluated using the r-IDEA score. The investigators found that the chatbot actually earned the highest r-IDEA scores, which actually proved to be quite impressive with regards to diagnostic reasoning. However, the authors also noted that the chatbot was âjust plain wrongâ more often.
Stephanie Cabral, M.D., the lead author of the study, explained that âfurther studies are needed to determine how LLMs can best be integrated into clinical practice, but even now, they could be useful as a checkpoint, helping us make sure we don’t miss something.â Summarily, the results indicated sound reasoning by the chatbot, however significant mistakes as well; this further bolsters the idea that these AI powered systems are best fit (atleast at their current maturity levels) as tools to augment a physicianâs practice, rather than replace a physicianâs diagnostic capabilities.
As is often explained by physician leaders and technologists alike, this is because the practice of medicine is not purely based on algorithmic outputs of rules, but is rather based on a deep sense of reasoning and clinical intuition, which is challenging to replicate by an LLM. Nevertheless, tools like these which can provide diagnostic or clinical support can still be an incredibly powerful asset in the physician workflow. For example, if systems can reasonably provide a âfirst-passâ or initial diagnosis suggestion based on the available data such as the patient history or existing records, it may allow physicians to save a significant amount of time in their diagnostic process. Furthermore, if these tools can augment the workflow of a physician and improve their means to process a large amount of clinical information from the medical record, there may be opportunities to increase efficiencies.
Many organizations are taking advantage of these potential means for clinical augmentation. For example, artificial intelligence powered scribing technologies are leveraging natural language processing to help physicians complete clinical documentation more efficiently. Enterprise search tools are being integrated within organizations and with EMR systems to help physicians search large swaths of data, promote data interoperability, and glean quicker and deeper insights on existing patient data. Other systems may even help offer an initial diagnosis; for example, tools are emerging in the fields of radiology and dermatology that are able to suggest a potential diagnosis by analyzing an uploaded photo.
Nevertheless, there is still a lot of work that needs to be done in this arena. Simply put, although AI systems such as these are not ready for clinical diagnostics, there may still be an opportunity to leverage this technology to augment clinical workflows, especially while keeping a human in the loop to ensure safe, secure, and accurate processes.