The Language Problem at the Heart of AI Evaluation in Healthcare
There is a question I ask the health leaders I work with quite often: "What do you mean by that word?"
One of two things happens. Either they mean something different than I assumed, which changes everything about how we proceed, or they say, "we don't actually have a definition." Both are useful answers. At least now we know.
That question, I have come to believe, gets at one of the most persistent problems in AI evaluation in healthcare today: people using the same word but meaning different things.
I personally experience this challenge with the word 'evaluation' itself: to a clinician, evaluation may mean: does this tool change how I make decisions at the bedside, and does it improve outcomes for my patients? To a developer or manufacturer: does the tool perform as specified against the validation dataset? To a regulator: does this product meet the technical standards required for market approval? To a program evaluator like me, it means something broader: to what extent does this AI-enabled initiative achieve its intended results, for whom, how, and under which circumstances?
My colleagues from across the world working in different field talk about other 'problematic' words, like, 'Human-in-the-loop'. Okay well....how many humans in how many loops? Which humans in which loops?, or 'compliance' . Okay well, legal or regulatory compliance?
None of these definitions is wrong. But when each stakeholder assumes the others share their definition, we have a problem. And it is precisely what makes multi-perspective AI evaluation in healthcare both difficult and necessary.
The answer is not a universal glossary handed down by a single authority. It is a sustained, multi-sector practice of asking, rigorously and repeatedly: "What do you mean by that word?"
A different kind of dictionary: built through debate
Earlier this week I posted on LinkedIn about the working we are doing in our Working Group on AI Evaluation and Validation in Healthcare at HealthAI — The Global Agency for Responsible AI in Health . We set out to do something deliberately unusual: build a dictionary of AI evaluation terms through structured debate. If we reach consensus, that's great, and if we don't, that is still valuable. At least now we know what the term means to others.
Our approach is this : each term is examined through five distinct sector lenses, specifically manufacturer and developer, regulator and policymaker, clinical and operational user, buyer and procurement professional, and program evaluation specialist. A term may end up with one definition, several, or one definition with important documented nuances, depending on what the debate surfaces.
Our working group has 20 members leading diverse initiatives across 15 countries and 9 time zones. Each month, we debate a single term and document what we learn. We want to produce definitions that are useful when the room includes a regulator from Germany, a clinician from Kenya, a developer from Canada, and a program evaluator working across all three contexts simultaneously. Again, I said something useful, not perfect.

The program evaluation lens
I co-chair the working group and bring a program evaluation lens to these discussions. Evaluation as a discipline has established frameworks, and hard-won lessons about evaluating complex systems. Bringing that knowledge into dialogue with AI development, clinical implementation, and regulatory process is the kind of cross-sector work that responsible AI in health requires. It is a missed opportunity not to connect these dialogues.
What this means for leaders evaluating AI-enabled solutions in healthcare
Here are my suggestions for leaders trying to evaluate AI-enabled solutions :
- Start by mapping your definitions before you design a framework. Do not assume that "validation" means the same thing to your developer partner as it does to your clinical team. Ask explicitly. Document the answers. If you hear disagreement or confusion, that is telling you something important about where implementation risks are likely to emerge. Spend enough time talking about the nuances.
- Bring a program evaluation specialist in early, not at the reporting stage. Treating AI evaluation as a technical exercise, the domain of data scientists and developers, produces evaluations that are technically sound but strategically incomplete. Program evaluation adds the frameworks, the equity lens, and the systems perspective that can help projects. Program evaluators are also very good at asking 'what do you mean by that question'?
Follow this series for monthly reflections on what the working group is learning, one term at a time.
Views expressed are my own.
Roxana Salehi, PhD, is a strategic evaluation consultant and the founder of Vitus Consulting. She co-chairs the Working Group on AI Evaluation and Validation in Healthcare at HealthAI — The Global Agency for Responsible AI in Health, and is the creator of the RECAP Framework for program theory construction. Her work spans AI-enabled health initiative evaluation, complex program evaluation, and strategic advisory for health sector leaders across Canada and internationally.