Seven common mistakes made when evaluating healthcare predictive tools
Building and deploying AI predictive tools in healthcare isn’t easy. The data are messy and challenging from the start, and building models that can integrate, adapt, and analyze this type of data requires a deep understanding of the latest AI/ML strategies and an ability to employ these strategies effectively. Recent studies and reporting have shown how hard it is to get it right, and how important it is to be transparent with what’s “under the hood” and the effectiveness of any predictive tool.
What makes this even harder is that the industry is still learning how to evaluate these types of solutions. While there are many entities and groups (such as the FDA) working diligently on creating guidelines and regulations to evaluate AI and predictive tools in healthcare, at the moment, there’s no governing body explaining the right way to do predictive tool evaluations, which is leaving a gap in terms of understanding what a solution should look like and how it should be measured.
As a result, many are making mistakes when evaluating AI and predictive solutions. These mistakes can lead to health systems choosing predictive tools that aren’t effective or appropriate for their population. As a long-time researcher in the field, I have seen these common mistakes made, and also have been guiding health systems on how to overcome them to have a safe, robust, and reliable tool.
Here are the top seven common mistakes typically made when evaluating an AI / predictive healthcare tool, and how to overcome these challenges to ensure an effective tool:
- Only the workflow is evaluated, not the models: The models are just as important as the workflow. Look for high performing models, e.g. with both high sensitivity and high precision before implementing within workflow. Not evaluating if the models work before implementation, and assuming you can obtain efficacy through optimizing workflows alone is like not knowing if a drug will work and changing the label on it to try to increase effectiveness.
- The models are evaluated, but with the wrong metrics: The models should be evaluated, but the metrics should be determined based on the mechanism of action for each condition area. For example, in sepsis, lead time–median time prior to antibiotics administration–is critical. But, you also don’t want to alert on too many people because low quality alerts that are not actionable will lead to provider burnout and over-treatment. The key criteria to look for in a sepsis tool are high sensitivity, significant lead time, and low false alerting rate.
- Adoption isn’t measured on a granular level: Typically, end user adoption isn’t measured. However, to obtain sustained outcome improvements, a framework for measuring adoption (at varying levels of granularity) and improving adoption is critical. Look to see if the tool also comes with an infrastructure that continuously monitors use, and provides strategies to improve and increase adoption.
- The impact on outcomes isn’t measured correctly: Many studies rely on coded data to identify cases and measure outcome impact. These are not reliable because coding is highly dependent on documentation practices and often a surveillance tool itself impacts documentation. In fact, a common flawed design is a pre/post study where the post period leverages a surveillance tool that dramatically increases the number of coded cases, in turn, leading to the perception that outcomes have improved because adverse rate (e.g., sepsis mortality rate on coded cases) has decreased. Look for rigorous studies of the tool that account for these types of issues.
- The ability to detect and tackle shifts isn’t identified: If a model doesn’t proactively tackle the issue of shifts and transportability, it is at risk of being “unsafe.” Strategies to reduce bias and adapt for dataset shift is critical because practice patterns are frequently changing (see what happened at one hospital during Covid-19, for example). Look for evidence of high performance across diverse populations to see if the solution is detecting and tuning appropriately for shifts (read more about best practices for combating dataset shift in this recent New England Journal of Medicine article).
- “Apples to oranges” outcome studies are compared: A common mistake is to overlook what the standard of care was in the environment where the outcome studies were done. For example, a 10% improvement in outcomes at a high reliability organization may be just as much or more impressive than similar improvement at a different organization with historically poor outcomes. Understanding the populations in which the studies were done and the standard of care in those environments will help you understand how and why the tool worked.
- Assuming a team of informaticists can tune any model to success: Keeping models tuned to be high-performing over time is a significant lift. Further, a common mistake is to assume any model can be made to work in your environment with enough rules and configurations added on top. The predictive AI tool should come with its own ability to tune, with an understanding of when and how to tune. Starting with the rudimentary model is akin to being given the names of molecules and asking you to create the right drug if you can mix the ingredients correctly.
When dealing with predictive AI tools in the healthcare space, the stakes could not be higher. As a result, predictive solutions need to be monitored and evaluated to ensure effectiveness, otherwise it’s likely the tools will have no impact, or worse, result in a negative patient impact. Understanding the common mistakes made, as well as the best practices for evaluation, will help health systems identify solutions that are safe, robust, and reliable, and ultimately, help physicians and care team members deliver safer, and higher quality care.
Learn more about Bayesian Health’s research-first mentality, recent evaluations and outcome studies here.