ADeLe v1.0: A battery for AI Evaluation with explanatory and predictive power

16,108 Annotated Instances
63 Tasks from 20 Benchmarks
18 Demand Dimensions

A collaborative community initiated by researchers at the Leverhulme Centre for the Future of Intelligence (Cambridge University), Microsoft Research, and the Center for Information Technology Policy (Princeton University).

To Manuel Cebrian, in memoriam.

The ADeLe (Annotated-Demand-Levels) battery includes 63 tasks from 20 benchmarks and was introduced in the original paper. This battery was annotated using 18 rubrics for Demand-Level-Annotation (DeLeAn v1.0) of general scales.

ADeLe, for the first time, enables researchers to infer the ability profiles of LLMs, comprehensively explaining what they can and cannot do. This makes it possible to understand and extrapolate benchmark results, and thus to anticipate when and where they will perform reliably and safely at instance-level. At the same time, ADeLe can be extended by applying the DeLeAn rubric to new benchmarks and thus understanding what they really measure. The figure below (from the original paper) shows this process.

Our methodology

From the original paper.

Demand Annotation of Benchmarks on General Scales

The Considered Dimensions

The DeLeAn rubrics consider 7 broad capabilities from Tolan et al. (2021) grounded in cognitive science and applicable to LLMs, and add subdimensions (leading to 11). It also includes domain ‘knowledge’ (with 5 subdimensions) and 2 ‘extraneous’ dimensions: Atypicality and Volume, to account for elements that make the task more challenging independently of primordial or knowledge demands. An additional dimension, Unguessability, is computed algorithmically by considering the number of choices, instead of using a rubric.

Dimension (Broad) Dimension (Specific) Description of Demands
AS Attention and Scan AS Attention and Scan Focus on or locate specific elements within a given stream of information or environment in the whole process of solving a task.
CE Comprehension and Expression CEc Verbal Comprehension Understand text, stories or the semantic content of other representations of ideas in different formats or modalities.
CEe Verbal Expression Generate and articulate ideas, stories, or semantic content in different formats or modalities.
CL Conceptualisation, Learning and Abstraction CL Conceptualisation, Learning and Abstraction Build new concepts, engage in inductive and analogical reasoning, map relationships between domains, and generate abstractions from concrete examples.
MC Metacognition and Critical Thinking MCr Identifying Relevant Information Recognise what information helps solve the task or does not, and how this recognition process unfolds as they work toward the solution.
MCt Critical Thinking Processes Monitor or regulate multiple thought processes to answer the question effectively, ranging from simple recall to high-level critical thinking.
MCu Calibrating Knowns and Unknowns Recognise the boundaries of one's knowledge and confidently identify what one knows they know, knows they don't know, or is uncertain about.
MS Mind Modelling and Social Cognition MS Mind Modelling and Social Cognition Model the minds of other agents or reasoning about how the beliefs, desires, intentions, and emotions of multiple other agents might interact to determine future behaviours.
QL Quantitative and Logical Reasoning QLl Logical Reasoning Match and apply rules, procedures, algorithms or systematic steps to premises to solve problems, derive conclusions and make decisions.
QLq Quantitative Reasoning Work with and reason about quantities, numbers, and numerical relationships.
SN Spatial Reasoning and Navigation SNs Spatio-physical Reasoning Understand spatial relationships between objects and predicting physical interactions.
KN Knowledge KNa Knowledge of Applied Sciences Knowledge or conceptual understanding in applied sciences (e.g., medicine, law, education, business, agriculture, engineering except IT).
KNc Customary Everyday Knowledge Knowledge in information that most people in a given society typically acquire through daily life experiences, social interactions, and media.
KNf Knowledge of Formal Sciences Knowledge or conceptual understanding in formal sciences (e.g., mathematics, logic, computer science, statistics).
KNn Knowledge of Natural Sciences Knowledge or conceptual understanding in natural sciences (e.g., physics, chemistry, biology, astronomy, earth sciences, ecology).
KNs Knowledge of Social Sciences Knowledge or conceptual understanding in social sciences and humanities (e.g., history, psychology, sociology, literature, art, philosophy).
AT Atypicality AT Atypicality How uncommon the task is or how unlikely it is that the instance has appeared in various sources (internet, textbooks, tests).
VO Volume VO Volume Proportional to the logarithm of the time a fully competent human needs to read and complete the task in ideal conditions, excluding interruptions.
UG Unguessability UG Unguessability The chance of error (percentage) of a task if following obvious cues or by random guess.

The Rubrics

Select a dimension below to view its rubric.

Description only With levels
Primordial
Knowledge
Extraneous

The ADeLe Battery

The ADeLe battery is obtained by running the DeLeAn rubrics on 63 tasks from 20 benchmarks, shown below. Only a subset of instances from each task was included in the benchmark (see the original paper for details).

AGIEval 4 benchmarks · 7 tasks 1,931
Civil Service ExaminationLogiQA-en408
GRE & GMATAQuA-RAT203
LSATLSAT-AR187
LSAT-LR470
LSAT-RC253
SATSAT-En196
SAT-Math214
ChemLLMBench 1 benchmark · 5 tasks 1,723
Molecule Captioning160
Molecule Design295
Name Prediction476
Reaction Prediction412
Retrosynthesis380
LiveBench 4 benchmarks · 7 tasks 291
Data AnalysisCTA33
LanguageConnections29
MathAMPS Hard69
Math Competition78
Olympiad26
ReasoningSpatial34
Zebra Puzzle22
MMLU-Pro 1 benchmark · 14 tasks 5,631
Biology447
Business410
Chemistry368
Computer Science345
Economics428
Engineering296
Health411
History304
Law362
Math425
Other429
Philosophy402
Physics377
Psychology427
MedCalcBench 1 benchmark · 7 tasks 556
Date27
Diagnosis14
Dosage20
Lab180
Physical214
Risk84
Severity17
OmniMath 1 benchmark · 7 tasks 1,664
Algebra337
Applied Mathematics302
Calculus30
Discrete Mathematics314
Geometry329
Number Theory322
Precalculus30
SciBench 1 benchmark · 3 tasks 355
Chemistry142
Math105
Physics108
TimeBench 6 benchmarks · 10 tasks 3,102
Date Arithmetic493
MCTACO205
MenatQA-Counterfactual130
MenatQA-Order157
MenatQA-Scope393
TempReason-L2318
TempReason-L3339
TimeDial340
TimeQA-explicit379
TimeQA-implicit348
TruthQuest 1 benchmark · 3 tasks 1,055
E344
I371
S340
Total 63 tasks from 20 benchmarks 16,308

Demand Distribution: What Do the Benchmarks Really Test For?

The annotations obtained with the DeLeAn rubrics allow to identify what demands the benchmarks composing ADeLe are loaded on. The following image (from the original paper) shows the overall distribution of demands on the ADeLe battery.

Demands on the overall collection ADeLe
View demand profiles for all 20 benchmarks
View demand correlation heatmap

The annotations obtained with the DeLeAn rubrics also allow to identify what demands are correlated with one another, which is important to understand what benchmarks really measure.

Correlation across demands in the ADeLe collection

From the original paper.

Profiling LLM Capabilities

By testing a LLM on the ADeLe benchmark, an ability “profile” can be extracted, representing an ability level for each considered dimension. The plot below shows the profile of the LLMs considered in the original paper.

Profiles of the considered LLMs

How Are Profiles Obtained?

From the annotated ADeLe battery and the instance-level results of a LLM, the ability value for each dimension is obtained by:

  1. Plotting the success probability of the LLM at increasing demand levels (characteristic curve). A dominant strategy is used to remove confounders: at any demand level, only instances of ADeLe where all other dimensions have demand lower than that of the considered dimension are kept.
  2. The ability score is then defined as the x-value where a logistic fit is at 0.5.

See the image below (from the original paper) for a visualization.

Characteristic curve

Characteristic Curves for All Considered LLMs and Demands

Select an LLM below to view its characteristic curves from the original paper.

How to Contribute

We encourage and welcome inputs from others in the research community. Here are some ways you can help out:

  • Try It Out: Use the battery or rubrics and share your instance-level results with us, including the information about instance_id, prompt, LLM response, and accuracy of the response.
  • Expand the ADeLe Battery: Apply the rubrics to your preferred or new benchmarks.
  • Enhance the DeLeAn Rubrics: Expand the rubrics to include other levels or dimensions.

We are working on easier ways for you to contribute directly to this initiative. In the meantime, please get in touch at jh2135 AT cam.ac.uk if you're interested in joining the effort.

BibTeX

Please consider citing our work if you found it useful:

@article{zhou2026general,
      title={General Scales Unlock AI Evaluation with Explanatory and Predictive Power},
      author={Zhou, Lexin and Pacchiardi, Lorenzo and Mart{\'i}nez-Plumed, Fernando and Collins, Katherine M. and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E. and Li, Zongqian and S{\'a}nchez-Garc{\'i}a, Pablo and Chen, Kexin Jiang and Casares, Pablo A. M. and Zu, Jiyun and Burden, John and Mehrbakhsh, Behzad and Stillwell, David and Cebrian, Manuel and Wang, Jindong and Henderson, Peter and Wu, Sherry Tongshuang and Kyllonen, Patrick C. and Cheke, Lucy and Xie, Xing and Hern{\'a}ndez-Orallo, Jos{\'e}},
      journal={Nature},
      year={2026},
      doi={10.1038/s41586-026-10303-2},
      url={https://www.nature.com/articles/s41586-026-10303-2},
}

Acknowledgements

This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program.

We thank Álvaro D. Gómez Antón and Felix Marti-Perez for contributing with additional experiments.