AI EVALUATION IN HEALTHCARE: A Decision-Analytic Tool for Selecting Meaningful Performance Metrics
DRAFT: V. 01. June 2026
Purpose: This tool helps healthcare teams working with AI select performance metrics that are meaningful for their context. It does not prescribe metrics; it guides teams through a structured selection process using three questions. The examples provided are illustrative and should be used alongside expert judgment and adapted for your specific project.
What is a performance metric?
A performance metric is a quantitative or qualitative measure used to assess AI in a healthcare context. Metrics can relate to activities, processes, products (including services), or systems[1]. In AI evaluation in healthcare, what counts as a valid and meaningful performance metric depends on three factors: the evaluand, the unit of analysis, and the evaluation lens. This tool walks you through each one.
How to use this tool
Answer the three questions below. Your answers together determine which metrics are meaningful for your specific project.
|
STEP 1 — START HERE What is your evaluand? What is the ‘thing’ you are evaluating? → AI model · Sensitivity and specificity metrics[2] · Explainability metrics (e.g., heatmaps, feature importance scores)[3] · Traceability metrics (e.g., logging of AI inputs/outputs, audit trail)[4] · Bias and fairness metrics (e.g., statistical parity, equalised odds)[5] · Other relevant metrics depending on the stage of AI lifecycle (design, development, evaluation, or post-deployment) → Implementation · Cost · Fidelity[6] · Usability and acceptability (e.g., System Usability Scale score, time to clinician competency)[7] · Other implementation-specific metrics based on taxonomies of implementation science constructs → Impact → Meaningful benefit to people or organizations · For patients (e.g., improved health outcomes) · For clinicians (e.g., minutes saved per encounter/day, confidence regarding clinical decision-making) |
|
STEP 2 What is the unit of analysis for your evaluand? The unit of analysis determines the level at which metrics are collected, aggregated, and reported. The same AI system assessed at project level will use different metrics than when assessed across a multi-country partnership. These are some potential units: · AI Model · An individual AI Project · Program · Alliance, Network or International Partnership
|
|
STEP 3 What is your evaluation lens? There are many different types of evaluations beyond randomized controlled trials. You will need different metrics depending on what evaluation lens you use. These are not just ‘nice-to-have’ domains you should be aware of. These are meaningful metrics that you can operationalize, track, and hold yourself and your project accountable for. For example:
→ Theory-based Evaluation → contextual indicators tracked over time, such as staff attitudes before AI introduction (baseline), existing challenges with electronic health records into which your AI tools are being integrated, team assumptions about the project, and the regulatory environment [8] → Feminist Evaluation → % budget for patient engagement; childcare costs covered (Y/N) → Made in Africa Evaluation → % salary spent in-country; number and composition of local team → Indigenous Evaluation → Indigenous data sovereignty principles in place (Y/N)[9] → Blue Marble Evaluation → Energy consumption per AI task; use of model pruning, quantisation, edge computing (Y/N)[10] |
References
• Hamid, A., & Nesma, E. I. D. (2025). ISO/IEC 22989: 2022 Artificial intelligence: Concepts and terminology: Analytical study. The International Journal of Informatics, Media and Communication Technology, 7(2), 227-336.
• Lekadir K, Frangi AF, et al.; FUTURE-AI Consortium. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ. 2025;388:e081554. https://doi.org/10.1136/bmj-2024-081554
• Proctor E, Silmere H, Raghavan R, et al. Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda. Administration and Policy in Mental Health. 2011;38(2):65–76.
• Segur-Ferrer J, Moltó-Puigmartí C, Pastells-Peiró R, Vivanco-Hidalgo RM. Health Technology Assessment Framework: Adaptation for Digital Health Technology Assessment. Madrid: Ministry of Health; Barcelona: AQuAS, 2023.
• Vaca S, Patton MQ. The Periodic Table of Evaluation, v2.0. January 2025. https://www.researchgate.net/publication/383211246_The_Periodic_Table_of_Evaluation
• Medical Research Council / National Institute for Health Research. A framework for developing and evaluating complex interventions. 2021.
[1] See Hamid, A., & Nesma, E. I. D. (2025). ISO/IEC 22989: 2022 Artificial intelligence: Concepts and terminology
[2]Sensitivity: the proportion of true positive cases correctly identified by the AI model. Specificity: the proportion of true negative cases correctly identified.
[3]Explainability metrics assess whether the AI model's reasoning is visible and meaningful to users — e.g., a heatmap highlights which part of an image the model weighted most heavily in its decision.
[4]Traceability: the ability to document and audit the complete trajectory of AI inputs, outputs, and decisions across the system's lifecycle (see FUTURE-AI, Lekadir et al., 2025).
[5]Statistical parity: equal positive prediction rates across groups. Equalised odds: equal true positive and false positive rates across groups (See Lekadir et al., 2025; Fairlearn: https://fairlearn.org/).
[6]Fidelity: the degree to which the AI was used as intended in real-world practice, e.g., the percentage of eligible encounters where the AI tool was actually applied as designed.
[7]System Usability Scale (SUS): a validated 10-item questionnaire that produces a score from 0–100 reflecting users' perception of ease of use (Brooke, 1996).
[8]Theory-based evaluation maps the assumed causal pathway between activities, outputs, and outcomes, and tracks whether contextual conditions and assumptions hold over time (MRC/NIHR Framework).
[9]Indigenous data sovereignty refers to the right of Indigenous peoples to govern the collection, ownership, and application of data about their communities. See OCAP® principles (First Nations Information Governance Centre, Canada).
[10] Blue Marble Evaluation focuses on global-scale systems and environmental interdependence. See Patton MQ. Blue Marble Evaluation. 2020.