
Artificial intelligence (AI) holds great promise in various aspects of healthcare. Already, ChatGPT is able to produce clinical letters indistinguishable from those written by human doctors, while Google has developed an AI agent capable of clinical history-taking and diagnostic reasoning.
The use of such general medical AI (GMAI) has been extensively studied at the Psychometrics Centre at Cambridge Judge Business School, which has developed a new model for how policymakers and regulators can improve the evaluation of the performance of different AI models in the medical field.
“GMAI can help assess health and support patient care,” says David Stillwell, Professor of Computational Social Science at Cambridge Judge. “But to use it safely, we need stronger evaluation methods to show where the technology is reliable and where it is not – and we still have a long way to go.”
GMAI can help assess health and support patient care, but to use it safely, we need stronger evaluation methods to show where the technology is reliable and where it is not – and we still have a long way to go.
Why benchmark-based medical exams fall short in evaluating AI performance
Current GMAI evaluations are based heavily on benchmarks – typically questions from medical licensing exams – and such evaluations are usually based on comparing the GMAI’s score to a passing score for aspiring human doctors. But such a comparison “is unable to inform the types of errors GMAI makes, identify their weaknesses, or provide insight into GMAI’s performance on tasks not within the benchmark assessment”, finds the research by David and his co-authors, who include Luning Sun, Research Associate at the Psychometrics Centre at Cambridge Judge, and José Hernández-Orallo, Senior Research Fellow at the Leverhulme Centre for the Future of Intelligence at the University of Cambridge.

For example, Chat GPT-4 was able to achieve a passing score on the Japanese national medical licensing examinations. “However, this seemingly promising result was coupled with the finding that large language models (LLMs) sometimes endorsed prohibited choices (such as euthanasia, which is strictly illegal in Japan) that should be strictly avoided in clinical practice,” says the research by David and his co-authors. “If one overlooks the types of errors in this case, the implementation of LLMs could lead to serious medical malpractice.”
For a human doctor, doing well on a medical exam is a good predictor of performing well in a range of medical tasks that are not in the exam. But for GMAI, that is not necessarily the case. Despite passing the exam with flying colours, GMAI still makes unexpected mistakes that a human doctor with an equivalent score would never make. That’s a problem, because we need to be able to trust GMAI in the same way that we trust doctors.
“Who knows what the next patient will bring in: it may be something that no one has ever seen before – as both disease patterns and treatment trends change over time – and these generalised benchmarks may offer no predictive help in evaluating whether an AI doctor may deal with such a new situation effectively,” says David.
Assessing medical skills with AI: a psychometric approach to knowledge and behaviour evaluation
The research at the Psychometrics Centre therefore proposes a new methodology to improve the quality of GMAI evaluation, drawing on modern psychometric techniques. Put simply, the research recommends that we identify the psychological constructs that underlie being a successful medical professional and then evaluates those constructs in GMAI.
The Psychometric Centre researchers draw on previous research at the Leverhulme Centre which found that 3 factors – reasoning, comprehension and core language modelling – account for 82% of the variance in the performance of LLMs on 27 cognitive tasks in the HELM (Holistic Evaluation of Language Models) benchmark developed at Stanford University to improve the transparency of language models.
By conceptually grouping the 27 tasks in this way, the Leverhulme research was able to articulate specific strengths and weaknesses of each LLM, and it helped address real-world research challenges such as data drift and distribution shifts that can distort predictive modelling through changes in statistical properties.
Alternative assessment formats to improve AI-based medical evaluation
Drawing on this research and psychometric principles, David and his co-authors suggest several assessment formats that are not limited to benchmarks – including practical, observational and situational assessment.
“Unlike benchmarks that tend to use a fixed set of static tasks, these alternative formats are more flexible in terms of what tasks are presented and how they are presented. Hence, they are more appropriate for constructs that are not covered by current benchmarks,” says the research.
For example, the research focuses on empathy in interaction with patients as an important competency that benchmarks alone may not properly evaluate. In everyday life, people may resort to gallows humour to defuse a difficult situation, perhaps as a fitting way to make someone feel better amid a sad backdrop. Yet in a healthcare situation, such a joke is clearly inappropriate. This is where GMAI can be very useful: whereas it may not be possible to measure empathy in a medical knowledge exam, GMAI can allow a chatbot to conduct a conversation with a person who has just received tragic news. “Notably, since no standard answers are provided, these alternative assessment formats are less susceptible to data contamination,” says the research.
Balancing AI innovation with patient safety: the role of human oversight in medical AI
As GMAI is increasingly integrated into routine healthcare services, human oversight is expected to diminish – or at least become more selective and high level – as society grows more confident in the ability of LLMs to arrive at an acceptable result at lower cost. Yet David and his colleagues urge caution and collaboration across professions as these decisions are reached.
Such robust evaluation of GMAI “necessitates joint efforts of researchers and practitioners from computer science, medicine, as well as psychometrics and collaborations with health care institutions”, they conclude, as this will allow us “to determine where the AI systems are reliable and where they may need more assistance, preferably at a case-by-case level that takes into account the stakes at risk” to ensure a balance between innovation and safeguarding of patient well-being.
Featured academics
David Stillwell
Professor of Computational Social Science
Luning Sun
Research Associate
Featured research
Sun, L., Gibbons, C., Hernández-Orallo, J., Wang, X., Jiang, L., Stillwell, D., Luo, F., and Xie, X. (2025) “Beyond benchmarks: evaluating generalist medical artificial intelligence with psychometrics” Journal of Medical Internet Research
Burnell, R., Hao, H., Conway, A. R. A., and Hernández-Orallo, J. (2023) “Revealing the structure of language model capabilities” arXiv