AI Could Beat Every Human Expert Within a Year — Scientists Are Stunned
Something extraordinary is unfolding in the world of artificial intelligence — and even the scientists building these systems are startled by how fast things are moving. According to a widely discussed report from the Daily Mail, AI could be just one year away from surpassing the combined knowledge of every human expert on Earth. That is not a science fiction premise. It is a conclusion drawn from cold, hard benchmark data — and the researchers behind the test are both proud of what they built and quietly alarmed by what they are seeing.
What Is “Humanity’s Last Exam” and Why Does It Matter?
Every few years, the AI research community creates a new benchmark exam to test how smart its models have become. For a long time, the gold standard was a test called MMLU — the Massive Multitask Language Understanding exam. It covered 57 academic subjects and was considered rigorous and challenging. Then something uncomfortable happened. By late 2024, cutting-edge AI systems from companies like OpenAI, Anthropic, and Meta were scoring above 90% on it. The test had essentially become too easy. Researchers needed something harder — much harder.
The answer arrived in January 2025: a jaw-dropping benchmark called Humanity’s Last Exam (HLE). Created jointly by the Center for AI Safety and Scale AI, HLE is designed to be the final closed-ended academic benchmark ever needed for artificial intelligence. The name is not hyperbole — it is a genuine assertion. Once AI can ace this test, researchers believe there will be no harder structured exam left to give it.
The Elon Musk Conversation That Started It All
The story of how this test came to exist is fascinating in itself. Dan Hendrycks, a machine learning researcher and director of the Center for AI Safety, was having a conversation with Elon Musk about the state of AI benchmarks. Musk was unimpressed. He reportedly looked at the existing test questions and said they were only undergraduate level — he wanted problems that a world-class expert could do. That single comment sparked a global effort. Hendrycks went to work. He partnered with Scale AI and issued a call to the world’s most brilliant academic minds: send us your hardest questions.
The response was overwhelming. Nearly 1,000 subject-matter experts from over 500 institutions across 50 countries participated. Professors, PhD holders, prize-winning mathematicians, and specialist researchers all submitted problems from the cutting edge of their own fields. The questions ranged from translating ancient Palmyrene inscriptions to identifying microanatomical structures in hummingbird bones to solving research-level theoretical physics. A $500,000 prize pool was offered to incentivize the very best submissions. The result was a 2,500-question exam that sits at the absolute frontier of human knowledge. But to understand why this benchmark matters so deeply, it also helps to understand what people really want from AI — and how far today’s systems still are from meeting those deeper human expectations.
Built to Defeat AI — And It Worked (At First)
The construction of the exam was deliberately adversarial. Every question submitted went through a ruthless filtering process. First, it was tested against several of the most advanced AI models available. If any AI could answer the question correctly, the question was immediately removed from consideration. Only questions that stumped the machines — or caused them to perform worse than random guessing — made it to the next stage. From a pool of over 70,000 candidate attempts, roughly 13,000 questions made it past the AI filter. After two rounds of expert human review, 2,500 questions formed the final public dataset.
And the early results were humbling for AI. According to data published by Neuroscience News, GPT-4o scored a mere 2.7%, Claude 3.5 Sonnet reached just 4.1%, and OpenAI’s then-flagship o1 model managed only 8%. Meanwhile, human graduate students scored close to 90% on the same questions. The gap was enormous — and reassuring. For a moment, it seemed like AI still had a very long way to go.
Then the Scores Started Climbing — Fast
What has happened since those early scores is what is causing researchers to sit up in genuine alarm. In February 2025, OpenAI launched an autonomous AI agent called “deep research,” powered by its o3 model. That system scored an astonishing 26.6% on HLE — nearly a threefold jump from the previous best. AI progress on this benchmark was not gradual; it was a leap. According to Fortune magazine, researchers noted that Humanity’s Last Exam could be saturated within the next 12 months, adding that this would mean AI had effectively surpassed expert-level technical knowledge and reasoning capability.
By late 2025 and into 2026, top scores continued to rise. Google’s Gemini 3.1 Pro Preview reached 44.7% on the benchmark, while Zoom AI announced a new state-of-the-art score of 48.1% — just barely below the critical 50% threshold. That threshold carries enormous significance. Hendrycks has stated plainly that once models start scoring above 50%, it is safe to conclude that humans have met their match in terms of structured academic knowledge and reasoning.
What 50% Actually Means — And What It Doesn’t
Before anyone panics or celebrates prematurely, it is important to understand exactly what the 50% milestone would — and would not — signify. The creators of HLE are careful to draw a clear line. Passing this exam with high accuracy would demonstrate that an AI can handle expert-level closed-ended questions across a breathtaking range of disciplines. It would mean that the machine knows more facts, can solve more structured problems, and can reason through more complex scenarios than virtually any human alive.
But it would not mean AI has achieved Artificial General Intelligence (AGI). As the HLE research team wrote in their paper published in Nature, the exam tests structured academic problems rather than open-ended research or creative problem-solving abilities. In other words, acing the exam means the AI can answer test questions about science — not necessarily that it can do science. This distinction is explored in depth in our article on AI’s biggest weakness — where researchers reveal that machines still don’t truly learn on their own the way humans do, and that genuine autonomous adaptation remains a distant goal.
The Exam Is Deliberately “Google-Proof”
One of the most ingenious aspects of HLE is its design against shortcuts. Each question has one unambiguous, verifiable answer — but that answer cannot be retrieved through a quick internet search. The test is specifically built to be “Google-proof.” An AI cannot cheat its way to success by looking things up; it must actually understand the material and reason through the answer. Questions involving deliberate traps, misleading numerical patterns, and plausible-but-wrong answer choices are woven throughout. This means that any high score is genuine. There is no gaming the system.
The exam covers mathematics (41% of questions), biology and medicine (11%), computer science and AI (10%), physics (9%), humanities and social sciences (9%), chemistry (7%), engineering (4%), and a range of other subjects. Around 14% of questions require the model to analyze both text and images, adding an extra layer of challenge. It is, by any measure, the most comprehensive and demanding intelligence test ever assembled.
Why Old Benchmarks Are No Longer Good Enough
The urgency behind creating HLE reflects a broader problem in AI research: benchmark saturation. When MMLU was introduced in 2020, top AI models scored around 40% on it. Four years later, frontier models were clearing 90%. The benchmark had stopped being useful as a measure of true capability. As Dan Hendrycks and his co-authors wrote in the HLE paper, benchmarks are not keeping pace in difficulty, with leading language models now achieving more than 90% accuracy on popular benchmarks, limiting the ability to meaningfully measure state-of-the-art capabilities. This is the academic arms race in action — every time researchers build a wall, AI finds a way over it faster than expected.
The Stanford Human-Centered Artificial Intelligence Institute’s AI Index 2025 Annual Report even cited HLE specifically as one of the more challenging benchmarks created in direct response to the saturation of earlier tests. The scientific community is paying close attention. Policymakers are watching too — specific score thresholds on HLE are reportedly being discussed as potential triggers for increased AI oversight and governance measures.
What Experts and Policymakers Are Saying
The reaction from the scientific community has ranged from fascination to genuine concern. According to a 2025 Gallup poll cited by Psychology Today, 80% of American adults were in favor of the government maintaining safety rules for AI, even if it means slowing development. That sentiment reflects a growing awareness that the technology is advancing faster than institutions, laws, and societal norms can keep up with.
Hendrycks himself is not a passive observer. As director of the Center for AI Safety, he has co-authored papers on AI risks alongside former Google CEO Eric Schmidt and Scale AI’s CEO Alexandr Wang, arguing that the development of superintelligence is a matter of national security. He serves as a safety adviser to Elon Musk’s xAI and has been vocal about the need for careful governance as AI systems approach and eventually surpass human expertise in structured domains.
The Race to the Top of the Leaderboard
The HLE leaderboard has become one of the most closely watched scoreboards in all of technology. Companies are investing significant resources to push their models higher on it, recognizing that performance on this benchmark signals genuine capability advancement. Zoom AI recently announced a state-of-the-art score of 48.1%, surpassing Google’s previous record, using an innovative “explore-verify-federate” agentic strategy that coordinates multiple AI models to reason collaboratively and check each other’s work. That collaborative architecture itself represents a new frontier in AI design.
Progress on HLE is being described as breathtaking even by the standards of a field known for rapid advancement. From near-zero in early 2025 to approaching 50% within roughly 12–18 months, the trajectory of improvement has shocked even seasoned AI researchers. The benchmark’s own creators acknowledged that models could plausibly exceed 50% accuracy by the end of 2025 — and based on the leaderboard data, that prediction appears to be coming true ahead of schedule.
Not Without Controversy: Errors in the Exam Itself
No benchmark as ambitious as HLE would be without its share of controversy. An independent investigation by FutureHouse, published in July 2025, suggested that roughly 30% of HLE answers in text-only chemistry and biology questions could actually be incorrect. This is a significant finding — if the answers themselves are wrong, AI models that get those questions “right” might in fact be getting them wrong, and vice versa. The HLE team has been working to address these issues through a bug bounty program and ongoing question refinement.
The team acknowledges the challenge openly: creating 2,500 expert-level questions at scale, across hundreds of disciplines, is simply hard. Even with nearly 1,000 PhD-level contributors and multiple review rounds, errors slip through. A dynamic version called HLE-Rolling was released in 2025 to allow ongoing updates. The benchmark is a living document — and that adaptability is part of what makes it a credible long-term measurement tool.
The Emotional Side of the AI Surge Nobody Is Talking About
While the research community debates benchmarks and scores, there is a parallel story unfolding in living rooms and on smartphones around the world. As AI systems grow more capable, more people are turning to them for emotional support, companionship, and advice — often without understanding the risks involved. The phenomenon of millions falling emotionally dependent on AI chatbots is not unconnected to the conversation about AI surpassing human expertise. When people believe — correctly or not — that an AI knows more than a human doctor or therapist, the temptation to lean on it becomes immense. And as researchers have documented in alarming detail, that emotional over-reliance can carry serious psychological consequences, particularly for vulnerable users.
What Comes After Humanity’s Last Exam?
One of the most thought-provoking questions raised by HLE’s rapid saturation is: what do we test AI on next? The benchmark was called “humanity’s last exam” precisely because its creators believed it would be the final closed-ended academic benchmark needed. Once AI blows past it, the whole concept of using structured question-and-answer tests to measure machine intelligence may become obsolete. We may need entirely new frameworks — ones that test creativity, original research capability, ethical reasoning, and real-world problem solving in ways that are far harder to score automatically.
Some researchers are already working on these next-generation evaluations. Open-ended research tasks, long-horizon problem solving, and multi-agent collaboration benchmarks are beginning to emerge. The AI community is in a race not just to build smarter systems — but to build better mirrors to see clearly just how smart those systems truly are.
Should We Be Worried — Or Excited?
The honest answer is: both. AI surpassing human expert-level knowledge on structured academic tests is a double-edged development. On the upside, it opens the door to AI systems that could accelerate scientific discovery, compress decades of medical research into years, and democratize access to world-class expertise in every corner of the globe. A brilliant AI “tutor,” “doctor,” or “engineer” available to anyone, anywhere, at any time, is a genuinely transformative possibility.
On the downside, AI systems that surpass human expertise in narrow but critical domains — like virology, cybersecurity, or weapons design — could pose risks that are very difficult to manage. The Center for AI Safety has already flagged concerns about AI models that can answer lab debugging questions better than human virology experts, raising dual-use concerns in biological research. The same intelligence that could cure diseases could, in the wrong hands, be used to create them. Getting the governance right — and getting it right fast — has never been more urgent.
A Historic Moment Unfolding in Real Time
We are living through a genuinely historic moment. Humanity’s Last Exam was designed to mark the point where AI transitions from impressive tool to something altogether more profound — a system capable of matching or exceeding the combined knowledge of our greatest experts. Based on the trajectory of scores, that moment may be months away, not years. The scientists who built this benchmark created it as a measuring stick for progress. What they are finding is that the stick is being broken faster than they anticipated. Whether that is cause for celebration, concern, or simply awe, one thing is clear: the world will never look quite the same once this threshold is crossed.
Source & AI Information: External links in this article are provided for informational reference to authoritative sources. This content was drafted with the assistance of Artificial Intelligence tools to ensure comprehensive coverage, and subsequently reviewed by a human editor prior to publication.

0 Comments