AI EVALUATION IN HEALTHCARE: A Decision-Analytic Tool for Selecting Meaningful Performance Metrics

01/06/2026
Roxana Salehi, PhD

DRAFT: V. 01. June 2026

Purpose: This tool helps healthcare teams working with AI select performance metrics that are meaningful for their context. It does not prescribe metrics; it guides teams through a structured selection process using three questions. The examples provided are illustrative and should be used alongside expert judgment and adapted for your specific project.

 

What is a performance metric?

A performance metric is a quantitative or qualitative measure used to assess AI in a healthcare context. Metrics can relate to activities, processes, products (including services), or systems[1]. In AI evaluation in healthcare, what counts as a valid and meaningful performance metric depends on three factors: the evaluand, the unit of analysis, and the evaluation lens. This tool walks you through each one.

How to use this tool

Answer the three questions below. Your answers together determine which metrics are meaningful for your specific project.

STEP 1 — START HERE

What is your evaluand?

What is the ‘thing’ you are evaluating?

  AI model 

·       Sensitivity and specificity metrics[2]

·       Explainability metrics (e.g., heatmaps, feature importance scores)[3]

·       Traceability metrics (e.g., logging of AI inputs/outputs, audit trail)[4]

·       Bias and fairness metrics (e.g., statistical parity, equalised odds)[5]

·       Other relevant metrics depending on the  stage of AI lifecycle (design, development, evaluation,  or post-deployment)

  Implementation  

·       Cost

·       Fidelity[6]

·       Usability and acceptability (e.g., System Usability Scale score, time to clinician competency)[7]

·       Other implementation-specific metrics based on taxonomies of implementation science constructs

  Impact   → Meaningful benefit to people or organizations

·       For patients (e.g., improved health outcomes)

·       For clinicians (e.g., minutes saved per encounter/day, confidence regarding clinical decision-making)

 

STEP 2

What is the unit of analysis for your evaluand?

The unit of analysis determines the level at which metrics are collected, aggregated, and reported. The same AI system assessed at project level will use different metrics than when assessed across a multi-country partnership. These are some potential units:

·       AI Model

·       An individual AI Project  

·       Program

·       Alliance, Network or International Partnership 

 

 

STEP 3

What is your evaluation lens?

There are many different types of evaluations beyond randomized controlled trials. You will need different metrics depending on what evaluation lens you use. These are not just ‘nice-to-have’ domains you should be aware of. These are meaningful metrics that you can operationalize, track, and hold yourself and your project accountable for. For example:

 

  Theory-based Evaluation   → contextual indicators tracked over time, such as staff attitudes before AI introduction (baseline), existing challenges with electronic health records into which your AI tools are being integrated, team assumptions about the project, and the regulatory environment [8]

  Feminist Evaluation  → % budget for patient engagement; childcare costs covered (Y/N)

  Made in Africa Evaluation   → % salary spent in-country; number and composition of local team

  Indigenous Evaluation   → Indigenous data sovereignty principles in place (Y/N)[9]

  Blue Marble Evaluation   → Energy consumption per AI task; use of model pruning, quantisation, edge computing (Y/N)[10]

 

References

     Hamid, A., & Nesma, E. I. D. (2025). ISO/IEC 22989: 2022 Artificial intelligence: Concepts and terminology: Analytical study. The International Journal of Informatics, Media and Communication Technology7(2), 227-336.

     Lekadir K, Frangi AF, et al.; FUTURE-AI Consortium. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ. 2025;388:e081554. https://doi.org/10.1136/bmj-2024-081554

     Proctor E, Silmere H, Raghavan R, et al. Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda. Administration and Policy in Mental Health. 2011;38(2):65–76.

     Segur-Ferrer J, Moltó-Puigmartí C, Pastells-Peiró R, Vivanco-Hidalgo RM. Health Technology Assessment Framework: Adaptation for Digital Health Technology Assessment. Madrid: Ministry of Health; Barcelona: AQuAS, 2023.

     Vaca S, Patton MQ. The Periodic Table of Evaluation, v2.0. January 2025. https://www.researchgate.net/publication/383211246_The_Periodic_Table_of_Evaluation

     Medical Research Council / National Institute for Health Research. A framework for developing and evaluating complex interventions. 2021.

 



[1] See Hamid, A., & Nesma, E. I. D. (2025). ISO/IEC 22989: 2022 Artificial intelligence: Concepts and terminology

[2]Sensitivity: the proportion of true positive cases correctly identified by the AI model. Specificity: the proportion of true negative cases correctly identified.

[3]Explainability metrics assess whether the AI model's reasoning is visible and meaningful to users — e.g., a heatmap highlights which part of an image the model weighted most heavily in its decision.

[4]Traceability: the ability to document and audit the complete trajectory of AI inputs, outputs, and decisions across the system's lifecycle (see FUTURE-AI, Lekadir et al., 2025).

[5]Statistical parity: equal positive prediction rates across groups. Equalised odds: equal true positive and false positive rates across groups (See Lekadir et al., 2025; Fairlearn: https://fairlearn.org/).

[6]Fidelity: the degree to which the AI was used as intended in real-world practice, e.g., the percentage of eligible encounters where the AI tool was actually applied as designed.

[7]System Usability Scale (SUS): a validated 10-item questionnaire that produces a score from 0–100 reflecting users' perception of ease of use (Brooke, 1996).

[8]Theory-based evaluation maps the assumed causal pathway between activities, outputs, and outcomes, and tracks whether contextual conditions and assumptions hold over time (MRC/NIHR Framework).

[9]Indigenous data sovereignty refers to the right of Indigenous peoples to govern the collection, ownership, and application of data about their communities. See OCAP® principles (First Nations Information Governance Centre, Canada).

[10] Blue Marble Evaluation focuses on global-scale systems and environmental interdependence. See Patton MQ. Blue Marble Evaluation. 2020.