ADeLe v1.0: A battery for AI Evaluation with explanatory and predictive power

Original Paper

Code

Dataset

LLM results

X thread

This is a collaborative community, initiated by researchers at the Leverhulme Centre for the Future of Intelligence from Cambridge University and the Center for Information Technology Policy from Princeton University, for the use and extension of ADeLe v1.0, a battery for AI evaluation with explanatory and predictive power, currently focusing on LLMs.

The ADeLe (Annotated-Demand-Levels) battery includes 63 tasks from 20 benchmarks and was introduced in the original paper. This battery was annotated using 18 rubrics for Demand-Level-Annotation (DeLeAn v1.0) of general scales.

ADeLe, for the first time, enables researchers to infer the ability profiles of LLMs, comprehensively explaining what they can and cannot do. This makes it possible to understand and extrapolate benchmark results, and thus to anticipate when and where they will perform reliably and safely at instance-level. At the same time, ADeLe can be extended by applying the DeLeAn rubric to new benchmarks and thus understanding what they really measure. The figure below (from the original paper) shows this process.

From the original paper.

Demand annotation of benchmarks on general scales

The considered dimensions

The DeLeAn rubrics consider 7 broad capabilities from Tolan et al. (2021) grounded in cognitive science (such as the Cattell–Horn–Carroll theory) and applicable to LLMs, and add subdimensions (leading to 11). It also includes domain ‘knowledge’ (with 5 subdimensions) and 2 ‘extraneous’ dimensions: Atypicality and Volume, to account for elements that make the task more challenging independently of primordial or knowledge demands. An additional dimension, Unguessability, is computed algorithmically by considering the number of choices, instead of using a rubric.

	Dimension (Broad)		Dimension (Specific)	Description of Demands
AS	Attention and Scan	AS	Attention and Scan	Focus on or locate specific elements within a given stream of information or environment in the whole process of solving a task.
AS	Attention and Scan	AS	Attention and Scan
CE	Comprehension and Expression	CEc	Verbal Comprehension	Understand text, stories or the semantic content of other representations of ideas in different formats or modalities.
		CEc	Verbal Comprehension
		CEe	Verbal Expression	Generate and articulate ideas, stories, or semantic content in different formats or modalities.
		CEe	Verbal Expression
CL	Conceptualisation, Learning and Abstraction	CL	Conceptualisation, Learning and Abstraction	Build new concepts, engage in inductive and analogical reasoning, map relationships between domains, and generate abstractions from concrete examples.


MC	Metacognition and Critical Thinking	MCr	Identifying Relevant Information	Recognise what information helps solve the task or does not, and how this recognition process unfolds as they work toward the solution.
		MCr	Identifying Relevant Information
		MCt	Critical Thinking Processes	Monitor or regulate multiple thought processes to answer the question effectively, ranging from simple recall to high-level critical thinking.
		MCt	Critical Thinking Processes
		MCu	Calibrating Knowns and Unknowns	Recognise the boundaries of one's knowledge and confidently identify what one knows they know, knows they don't know, or is uncertain about.
		MCu	Calibrating Knowns and Unknowns
MS	Mind Modelling and Social Cognition	MS	Mind Modelling and Social Cognition	Model the minds of other agents or reasoning about how the beliefs, desires, intentions, and emotions of multiple other agents might interact to determine future behaviours.


QL	Quantitative and Logical Reasoning	QLl	Logical Reasoning	Match and apply rules, procedures, algorithms or systematic steps to premises to solve problems, derive conclusions and make decisions.
		QLl	Logical Reasoning
		QLq	Quantitative Reasoning	Work with and reason about quantities, numbers, and numerical relationships.
		QLq	Quantitative Reasoning
SN	Spatial Reasoning and Navigation	SNs	Spatio-physical Reasoning	Understand spatial relationships between objects and predicting physical interactions.
SN	Spatial Reasoning and Navigation	SNs	Spatio-physical Reasoning
KN	Knowledge	KNa	Knowledge of Applied Sciences	Knowledge or conceptual understanding in applied sciences (e.g., medicine, law, education, business, agriculture, engineering except IT).
		KNa	Knowledge of Applied Sciences
		KNc	Customary Everyday Knowledge	Knowledge in information that most people in a given society typically acquire through daily life experiences, social interactions, and media.
		KNc	Customary Everyday Knowledge
		KNf	Knowledge of Formal Sciences	Knowledge or conceptual understanding in formal sciences (e.g., mathematics, logic, computer science, statistics).
		KNf	Knowledge of Formal Sciences
		KNn	Knowledge of Natural Sciences	Knowledge or conceptual understanding in natural sciences (e.g., physics, chemistry, biology, astronomy, earth sciences, ecology).
		KNn	Knowledge of Natural Sciences
		KNs	Knowledge of Social Sciences	Knowledge or conceptual understanding in social sciences and humanities (e.g., history, psychology, sociology, literature, art, philosophy).
		KNs	Knowledge of Social Sciences
AT	Atypicality	AT	Atypicality	How uncommon the task is or how unlikely it is that the instance has appeared in various sources (internet, textbooks, tests).
AT	Atypicality	AT	Atypicality
VO	Volume	VO	Volume	Proportional to the logarithm of the time a fully competent human needs to read and complete the task in ideal conditions, excluding interruptions.
VO	Volume	VO	Volume
UG	Unguessability	UG	Unguessability	The chance of error (percentage) of a task if following obvious cues or by random guess.
UG	Unguessability	UG	Unguessability

The rubrics

Below we show the rubrics for each dimension.

Select a Dimension:

The ADeLe battery

The ADeLe battery is obtained by running the DeLeAn rubrics on 63 tasks from 20 benchmarks, shown below. Only a subset of instances from each task was included in the benchmark (see the original paper for details).

Source	Benchmark	Task	#Instances
AGIEval	Civil Service Examination	LogiQA-en	408
	GRE & GMAT	AQuA-RAT	203
	LSAT	LSAT-AR	187
		LSAT-LR	470
		LSAT-RC	253
	SAT	SAT-En	196
	SAT	SAT-Math	214
ChemLLMBench	ChemLLMBench	Molecule Captioning	160
		Molecule Design	295
		Name Prediction	476
		Reaction Prediction	412
		Retrosynthesis	380
LiveBench	Data Analysis	CTA	33
	Language	Connections	29
	Math	AMPS Hard	69
		Math Competition	78
		Olympiad	26
	Reasoning	Spatial	34
	Reasoning	Zebra Puzzle	22
MMLU-Pro	MMLU-Pro	Biology	447
		Business	410
		Chemistry	368
		Computer Science	345
		Economics	428
		Engineering	296
		Health	411
		History	304
		Law	362
		Math	425
		Other	429
		Philosophy	402
		Physics	377
		Psychology	427
MedCalcBench	MedCalcBench	Date	27
		Diagnosis	14
		Dosage	20
		Lab	180
		Physical	214
		Risk	84
		Severity	17
OmniMath	OmniMath	Algebra	337
		Applied Mathematics	302
		Calculus	30
		Discrete Mathematics	314
		Geometry	329
		Number Theory	322
		Precalculus	30
SciBench	SciBench	Chemistry	142
		Math	105
		Physics	108
TimeBench	Date Arithmetic	Date Arithmetic	493
	MCTACO	MCTACO	205
	MenatQA	MenatQA-Counterfactual	130
		MenatQA-Order	157
		MenatQA-Scope	393
	TempReason	TempReason-L2	318
	TempReason	TempReason-L3	339
	TimeDial	TimeDial	340
	TimeQA	TimeQA-explicit	379
	TimeQA	TimeQA-implicit	348
TruthQuest	TruthQuest	E	344
		I	371
		S	340

Demand distribution of ADeLe: what do the benchmarks really test for?

The annotations obtained with the DeLeAn rubrics allow to identify what demands the benchmarks composing ADeLe are loaded on. The following image (from the original paper) shows the overall distribution of demands on the ADeLe battery.

Demands for each of the benchmarks

Below, you can see the demand distribution for each of the 20 benchmarks in the ADeLe battery separately (from the original paper).

Expand

Correlation across demands in the ADeLe collection

The annotations obtained with the DeLeAn rubrics also allow to identify what demands are correlated with one another, which is important to understand what benchmarks really measure. The plot below shows correlation values (from the original paper).

Expand

Correlation across demands in the ADELE collection

Profiling LLM capabilities

By testing a LLM on the ADeLe benchmark, an ability "profile" can be extracted, representing an ability level for each considered dimension. The plot below shows the profile of the LLMs considered in the original paper.

How are profiles obtained?

From the annotated ADeLe battery and the instance-level results of a LLM, the ability value for each dimension is obtained by:

plotting the success probability of the LLM at increasing demand levels (characteristic curve). A dominant strategy is used to remove confounders: at any demand level, only instances of ADeLe where all other dimensions have demand lower than that of the considered dimension are kept.
The ability score is then defined as the x-value where a logistic fit is at 0.5.

See the image below (from the original paper) for a visualization.

Characteristic curves for all considered LLMs and demands

Below, you can see the characteristic curves for all considered demands and LLMs from the original paper.

Select an LLM:

How to contribute

We encourage and welcome inputs from others in the research community. Here are some ways you can help out:

Try It Out: Use the battery or rubrics and share your instance-level results with us, including the information about instance_id, prompt, LLM response, and accuracy of the response.
Expand the ADeLe Battery: Apply the rubrics to your preferred or new benchmarks.
Enhance the DeLeAn Rubrics: Expand the rubrics to include other levels or dimensions.

We are working on easier ways for you to contribute directly to this initiative. In the meantime, please get in touch at jh2135 AT cam.ac.uk if you’re interested in joining the effort: jh2135 AT cam.ac.uk.

BibTeX

Please consider citing the original work if you found it useful:

@misc{zhou2025generalscalesunlockai,
      title={General Scales Unlock AI Evaluation with Explanatory and Predictive Power},
      author={Lexin Zhou and Lorenzo Pacchiardi and Fernando Martínez-Plumed and Katherine M. Collins and Yael Moros-Daval and Seraphina Zhang and Qinlin Zhao and Yitian Huang and Luning Sun and Jonathan E. Prunty and Zongqian Li and Pablo Sánchez-García and Kexin Jiang Chen and Pablo A. M. Casares and Jiyun Zu and John Burden and Behzad Mehrbakhsh and David Stillwell and Manuel Cebrian and Jindong Wang and Peter Henderson and Sherry Tongshuang Wu and Patrick C. Kyllonen and Lucy Cheke and Xing Xie and José Hernández-Orallo},
      year={2025},
      eprint={2503.06378},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.06378},
}