New study’s model predicts how AI will perform on unseen tasks

The study published yesterday (1 April) in the journal Nature also shows that AI can reason – but only to a point – resolving a long-running debate by showing that contradictory claims about AI reasoning simply reflect tests demanding very different levels of the same abilities. The study uses the newly formulated ability profile – a fingerprint showing where an AI model is strong and where it breaks down – to explain why a model fails and to predict whether it will succeed, even on tasks it has never encountered before.

“To date, AI evaluation is not meeting the demands of a fast-changing and increasingly diverse AI ecosystem. Understanding and anticipating performance has become an urgent requirement for a swath of general-purpose AI systems,” says the study, adding that the new methodology is comprehensive and scalable in a way that addresses drawbacks in conventional AI evaluation including a lack of explanatory and predictive power.

Co-authors of the new study include David Stillwell, Professor of Computational Social Science and Academic Director of the Psychometrics Centre at Cambridge Judge Business School, and Luning Sun, Research Associate at the Psychometrics Centre.

New study adopts model to predict how AI will perform on unseen tasks

The key to the new research is a methodology called ADeLe (AI Evaluation with Demand Levels) that goes beyond measuring aggregate accuracy on a set of benchmarks by extracting a set of broad capability dimensions that profile both benchmarks and large language models (LLMs) – and this enables predictions that are transferable to unseen tasks.

The new system organises the vast space of cognitive tasks that LLMs face into just 18 key dimensions – including attention, reasoning and how unusual the task is, and then scores any real-world task on each of these dimensions based on how much it demands of the specific capability. By running a model through enough of these demand-scored tasks, the ability profile emerges.

Key findings greatly advance understanding of LLMs

Using ADeLe, the research team evaluated numerous AI benchmarks and uncovered 4 key findings:

1

Current AI benchmarks do not measure what they claim to measure, often testing other abilities they weren’t designed for.

2

AI models show distinct patterns of strengths and weaknesses across different capabilities, according to their size, reasoning methodology and model family.

3

The new ADeLe system provides accurate explanations and predictions of whether AI systems will succeed or fail on a specific new task.

4

Conflicting research on whether AI models are capable of reasoning are both partially right – but they are talking about different difficulty levels. Some current AI benchmarks require only basic problem-solving, while others need advanced logic, abstraction and deep domain knowledge.

Crucial steps to make AI evaluation fit for purpose

In a moment where AI evaluation is at the crux of research and regulations, and the science of evaluation had not yet digested the pace of general-purpose AI, our work takes crucial steps to make AI evaluation fit for purpose.

The study concludes:

“The methodology we have presented and illustrated is comprehensive, scalable and standardising, addressing many of the issues of conventional AI evaluation practice: a lack of explanatory and predictive power, as well as saturation and overfitting to specific populations of benchmarks and AI systems respectively.

“With the pace and penetration of general-purpose AI, a rigorous, scalable and pipelined evaluation had been urgently demanded by researchers, companies, third-party evaluators, policy-makers and regulators.

“In a moment where AI evaluation is at the crux of research and regulations, and the science of evaluation had not yet digested the pace of general-purpose AI, our work takes crucial steps to make AI evaluation fit for purpose.”

The study, titled General Scales Unlock AI Evaluation with Explanatory and Predictive Power, was led by the University of Cambridge and co-authored by researchers from a range of institutions including Microsoft, the Educational Testing Service in the US, and the Centre for Automation and Robotics in Spain. The study also involved researchers from these universities: Princeton, Carnegie Mellon, William & Mary, and Universitat Politècnica de València in Spain. At the University of Cambridge, the authors include academics from the Leverhulme Centre for the Future of Intelligence, the Psychometrics Centre at Cambridge Judge Business School, and the Departments of Engineering, Psychology and Theoretical and Applied Linguistics.

Featured academics

David Stillwell

Professor of Computational Social Science

View David's profile

Luning Sun

Research Associate

View Luning's profile

Featured research

Zhou, L., Pacchiardi, L., Martínez-Plumed, F., Sun, L., Stillwell, D. et al (2026) “General scales unlock AI evaluation with explanatory and predictive power.” Nature, 652: 58-67 (DOI: 10.1038/s41586-026-10303-2)

Psychometrics Centre

The Psychometrics Centre researches human behaviour to improve global standards in psychometrics for better business outcomes and user experiences.

Learn more about the Centre

Cambridge launches digital identity regulatory initiative

Need help funding your degree programme studies at Cambridge Judge?

Not sure which programme is for you?

How leaders turn negative emotions into positive outcomes

Find an expert

Find an expert

Leave your mark in LT1

New study’s model predicts how AI will perform on unseen tasks

The article at a glance

New study adopts model to predict how AI will perform on unseen tasks

Key findings greatly advance understanding of LLMs

1

2

3

4

Crucial steps to make AI evaluation fit for purpose