Humanity’s Last Exam: New AI Benchmark Challenges Models with PhD-Level Knowledge
Researchers from the Center for AI Safety and Scale AI have launched "Humanity’s Last Exam" (HLE), a new test designed to evaluate the proximity of advanced artificial intelligence (AI) models to human-level knowledge across over 100 subjects. The framework for the exam was published in the journal Nature on January 28, 2025, following its launch earlier that month.
The HLE is intended to provide a difficult benchmark for AI capabilities, with initial testing showing AI models performing significantly below human expert levels.
Exam Structure and Development
The HLE comprises 2,500 questions spanning more than 100 subjects. Its development involved over 1,000 subject-matter experts from 500 institutions across 50 countries. The exam features both multiple-choice and short-answer questions, all designed to have unambiguous, verifiable solutions that cannot be readily answered through internet retrieval or pre-existing training data.
To ensure its difficulty for AI models, strict submission criteria were enforced for questions, requiring them to be precise, solvable, and non-searchable. Each submitted question was initially tested against AI models; any questions answered correctly were rejected. Out of more than 70,000 attempted submissions, approximately 13,000 questions successfully posed a challenge to large language models (LLMs). These questions underwent further expert vetting and were eventually narrowed down to the final 2,500, which generally reflect PhD-level difficulty. Examples of questions include a trivia inquiry on Greek mythology and a physics problem detailing forces during motion.
Initial Performance of AI Models
Initial testing of the HLE included OpenAI’s GPT-4o and o1 models, Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, and DeepSeek R1. OpenAI’s o1 system achieved the highest initial score at 8.3%.
As of February 12, 2026, Google’s Gemini 3 Deep Think holds the highest recorded score, at 48.4%. Researchers had previously projected that AI models could exceed 50% accuracy on HLE by the end of 2025. In comparison, human experts typically achieve scores around 90% in their respective domains.
Distinction from Other Benchmarks
The creators of the HLE state that its broad scope of subjects differentiates it from other common benchmarking tools. They highlight that existing tests, such as the Massive Multitask Language Understanding (MMLU) dataset, often concentrate on smaller subsets of expert-level knowledge, primarily in areas like coding and mathematics.
The HLE also aims to address issues of memorization and searchability found in benchmarks such as Francois Chollet’s ARC-AGI suite.
Gemini’s Deep Think scored 84.6% on the ARC-AGI-2 benchmark, yet achieved only 48.4% on the HLE, illustrating the difference in the challenges presented by the tests.
Implications for Artificial General Intelligence (AGI)
The authors of the study emphasize that achieving a high score on the HLE does not signify the arrival of artificial general intelligence (AGI). They state that while high accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, it would not, by itself, indicate autonomous research capabilities or true AGI.
Manuel Schottdorf, a neuroscientist who contributed to the HLE, noted that while machines need to be capable of solving these questions, this ability alone is not sufficient to conclude that machines possess true intelligence.