For years, AI benchmarks have been the scoreboard of progress.

Researchers give models tests — math problems, coding challenges, trivia questions — and measure how well they perform.

But there’s a problem.

AI is starting to ace the exams.

Modern language models now score over 90% on popular benchmarks like Massive Multitask Language Understanding (MMLU). Tests that once measured the frontier of AI capability are becoming too easy.

So researchers decided to create something much harder.

They call it Humanity’s Last Exam.

A Test Designed to Break AI

The benchmark, created by the Center for AI Safety and Scale AI with contributions from nearly 1,000 experts worldwide, contains 2,500 extremely difficult academic questions.

These aren’t simple trivia questions.

They cover over 100 subjects, including:

• advanced mathematics
• physics and chemistry
• history and humanities
• engineering and computer science

Some questions even include diagrams or images that models must interpret.

The goal is simple:

Create a test that measures how close AI is to expert-level human knowledge.

Questions the Internet Can’t Answer

One rule defined the benchmark.

The answers cannot be easily found online.

Questions must require real reasoning or deep expertise.

Every submission goes through multiple steps:

  1. The question is tested against top AI models.

  2. If AI can answer it, the question is rejected.

  3. If AI fails, human experts review it.

  4. Only the best questions make the final exam.

After filtering more than 13,000 submissions, the researchers selected 2,500 questions that consistently stumped AI systems.

The Result: AI Still Struggles

When the best AI models were tested on the benchmark, their performance was surprisingly low.

Accuracy remained far below expert human levels.

Even more interesting:

The models often gave confident answers that were completely wrong.

Researchers found large calibration errors, meaning AI systems frequently believed they were correct even when they weren’t.

In other words, the models didn’t just struggle with the questions.

They also struggled to recognize when they were out of their depth.

Why This Benchmark Matters

Benchmarks shape how the world understands AI progress.

If every test becomes too easy, it becomes impossible to measure real improvements.

Humanity’s Last Exam creates a new reference point.

If AI models eventually score highly on this benchmark, it would suggest something remarkable:

They are approaching expert-level knowledge across dozens of academic fields.

But for now, there is still a significant gap between current AI systems and human specialists.

The Race Between AI and Its Tests

History shows that AI benchmarks rarely stay difficult for long.

Tests that once seemed impossible are eventually solved.

The real question is not whether AI will improve.

It’s how quickly it will catch up to the hardest questions humans can design.

And when it does…

Researchers may need to create an even harder exam.

Recommended for you