How psychometrics can enhance AI healthcare evaluation

Professor David Stillwell

Artificial intelligence (AI) holds great promise in various aspects of healthcare. Already, ChatGPT is able to produce clinical letters indistinguishable from those written by human doctors, while Google has developed an AI agent capable of clinical history-taking and diagnostic reasoning.

The use of such general medical AI (GMAI) has been extensively studied at the Psychometrics Centre at Cambridge Judge Business School, which has developed a new model for how policymakers and regulators can improve the evaluation of the performance of different AI models in the medical field.

“GMAI can help assess health and support patient care,” says David Stillwell, Professor of Computational Social Science at Cambridge Judge. “But to use it safely, we need stronger evaluation methods to show where the technology is reliable and where it is not – and we still have a long way to go.”

GMAI can help assess health and support patient care, but to use it safely, we need stronger evaluation methods to show where the technology is reliable and where it is not – and we still have a long way to go.

David Stillwell, Professor of Computational Social Science

Why benchmark-based medical exams fall short in evaluating AI performance

Current GMAI evaluations are based heavily on benchmarks – typically questions from medical licensing exams – and such evaluations are usually based on comparing the GMAI’s score to a passing score for aspiring human doctors. But such a comparison “is unable to inform the types of errors GMAI makes, identify their weaknesses, or provide insight into GMAI’s performance on tasks not within the benchmark assessment”, finds the research by David and his co-authors, who include Luning Sun, Research Associate at the Psychometrics Centre at Cambridge Judge, and José Hernández-Orallo, Senior Research Fellow at the Leverhulme Centre for the Future of Intelligence at the University of Cambridge.

Dr Luning Sun

For example, Chat GPT-4 was able to achieve a passing score on the Japanese national medical licensing examinations. “However, this seemingly promising result was coupled with the finding that large language models (LLMs) sometimes endorsed prohibited choices (such as euthanasia, which is strictly illegal in Japan) that should be strictly avoided in clinical practice,” says the research by David and his co-authors. “If one overlooks the types of errors in this case, the implementation of LLMs could lead to serious medical malpractice.”

For a human doctor, doing well on a medical exam is a good predictor of performing well in a range of medical tasks that are not in the exam. But for GMAI, that is not necessarily the case. Despite passing the exam with flying colours, GMAI still makes unexpected mistakes that a human doctor with an equivalent score would never make. That’s a problem, because we need to be able to trust GMAI in the same way that we trust doctors.

“Who knows what the next patient will bring in: it may be something that no one has ever seen before – as both disease patterns and treatment trends change over time – and these generalised benchmarks may offer no predictive help in evaluating whether an AI doctor may deal with such a new situation effectively,” says David.

Assessing medical skills with AI: a psychometric approach to knowledge and behaviour evaluation

The research at the Psychometrics Centre therefore proposes a new methodology to improve the quality of GMAI evaluation, drawing on modern psychometric techniques. Put simply, the research recommends that we identify the psychological constructs that underlie being a successful medical professional and then evaluates those constructs in GMAI.

The Psychometric Centre researchers draw on previous research at the Leverhulme Centre which found that 3 factors – reasoning, comprehension and core language modelling – account for 82% of the variance in the performance of LLMs on 27 cognitive tasks in the HELM (Holistic Evaluation of Language Models) benchmark developed at Stanford University to improve the transparency of language models.

By conceptually grouping the 27 tasks in this way, the Leverhulme research was able to articulate specific strengths and weaknesses of each LLM, and it helped address real-world research challenges such as data drift and distribution shifts that can distort predictive modelling through changes in statistical properties.

Alternative assessment formats to improve AI-based medical evaluation

Drawing on this research and psychometric principles, David and his co-authors suggest several assessment formats that are not limited to benchmarks – including practical, observational and situational assessment.

“Unlike benchmarks that tend to use a fixed set of static tasks, these alternative formats are more flexible in terms of what tasks are presented and how they are presented. Hence, they are more appropriate for constructs that are not covered by current benchmarks,” says the research.

For example, the research focuses on empathy in interaction with patients as an important competency that benchmarks alone may not properly evaluate. In everyday life, people may resort to gallows humour to defuse a difficult situation, perhaps as a fitting way to make someone feel better amid a sad backdrop. Yet in a healthcare situation, such a joke is clearly inappropriate. This is where GMAI can be very useful: whereas it may not be possible to measure empathy in a medical knowledge exam, GMAI can allow a chatbot to conduct a conversation with a person who has just received tragic news. “Notably, since no standard answers are provided, these alternative assessment formats are less susceptible to data contamination,” says the research.

Balancing AI innovation with patient safety: the role of human oversight in medical AI

As GMAI is increasingly integrated into routine healthcare services, human oversight is expected to diminish – or at least become more selective and high level – as society grows more confident in the ability of LLMs to arrive at an acceptable result at lower cost. Yet David and his colleagues urge caution and collaboration across professions as these decisions are reached.

Such robust evaluation of GMAI “necessitates joint efforts of researchers and practitioners from computer science, medicine, as well as psychometrics and collaborations with health care institutions”, they conclude, as this will allow us “to determine where the AI systems are reliable and where they may need more assistance, preferably at a case-by-case level that takes into account the stakes at risk” to ensure a balance between innovation and safeguarding of patient well-being.

Featured academics

David Stillwell

Professor of Computational Social Science

View David's profile

Luning Sun

Research Associate

View Luning's profile

Featured research

Sun, L., Gibbons, C., Hernández-Orallo, J., Wang, X., Jiang, L., Stillwell, D., Luo, F., and Xie, X. (2025) “Beyond benchmarks: evaluating generalist medical artificial intelligence with psychometrics” Journal of Medical Internet Research

Burnell, R., Hao, H., Conway, A. R. A., and Hernández-Orallo, J. (2023) “Revealing the structure of language model capabilities” arXiv

Four key principles for companies to use data targeting in a positive way. By Dr David Stillwell, University Lecturer in Big Data Analytics & Quantitative Social Science at Cambridge Judge Business School and Academic Director of the Psychometrics Centre BBC Click, the broadcaster's tech programme, aired its 1,000th episode earlier this month. To celebrate, it created a "Choose your own adventure" TV show, in which you – the viewer – make choices about what to watch next. Do you want to watch a segment about self-driving vehicles or one about robotics innovations in Africa? Instead of everyone watching the same programme, the TV is personalised to you. But there's a twist. While you watch the TV, the TV watches you. The choices you make are recorded and used to profile you, including and especially the seemingly inconsequential ones. For example, before the programme even starts you're asked to choose whether "I'm ready" or "LET'S DO THIS!!!!" The former constrained language is how introverts talk, the latter is more likely to be the exclamation of an extrovert. One little choice does not build a full profile, but after many choices a consistent picture starts to emerge. The programme takes profiling to…

From the Olympics to Cambridge: an ‘MBA to watch’

Need help funding your degree programme studies at Cambridge Judge?

Not sure which programme is for you?

Why meat substitutes still face resistance

Find an expert

Find an expert

Leave your mark in LT1

The article at a glance

Why benchmark-based medical exams fall short in evaluating AI performance

Assessing medical skills with AI: a psychometric approach to knowledge and behaviour evaluation

Alternative assessment formats to improve AI-based medical evaluation

Balancing AI innovation with patient safety: the role of human oversight in medical AI

Featured academics

David Stillwell

Luning Sun

Featured research

For staff & students

For staff & students

From the Olympics to Cambridge: an ‘MBA to watch’

Need help funding your degree programme studies at Cambridge Judge?

Not sure which programme is for you?

Why meat substitutes still face resistance

Find an expert

Find an expert

Leave your mark in LT1

How psychometrics can enhance AI healthcare evaluation

The article at a glance

Why benchmark-based medical exams fall short in evaluating AI performance

Assessing medical skills with AI: a psychometric approach to knowledge and behaviour evaluation

Alternative assessment formats to improve AI-based medical evaluation

Balancing AI innovation with patient safety: the role of human oversight in medical AI

Featured academics

David Stillwell

Luning Sun

Featured research

Related articles

Why social media users like sharing negative news

How Minecraft can be used in assessment

While you watch the TV, the TV watches you

For staff & students

For staff & students