Is AGI Here? New Benchmark Says Not Even Close

ARC-AGI-3 launched this week — every frontier AI model scored below 1% while untrained humans hit 100%. Jensen Huang said AGI is here. The data disagrees.

What to Know

ARC-AGI-3 dropped this week — every frontier model tested scored below 1%, while untrained humans solved 100% of environments
Gemini 3.1 Pro led all AI models at just 0.37%; GPT-5.4 scored 0.26%, Claude Opus 4.6 scored 0.25%, and Grok-4.20 scored exactly 0%
The benchmark uses RHAE scoring — penalizing inefficient AI behavior — and keeps 110 of 135 environments private to prevent training-based cheating
ARC Prize 2026 is offering $2 million in prize money across three Kaggle competition tracks, with all winning solutions required to be open-sourced

The new ARC-AGI-3 artificial general intelligence benchmark arrived this week and delivered one of the more humbling data points in recent AI history. Every major frontier model tested — from Google's Gemini to OpenAI's latest — scored below 1%. Ordinary humans with zero training, zero instructions, and zero context scored 100%. That gap is the story, and it lands at a particularly awkward moment for an industry that has been loudly declaring victory.

Jensen Huang Said AGI Is Here. Then the Scores Dropped.

Two days before the benchmark results published, Nvidia CEO Jensen Huang sat down with Lex Fridman and said, plainly, "I think we've achieved AGI." It's a line that would have been remarkable even a year ago. Now it barely makes headlines — that's how normalized the AGI victory lap has become.

Then the Jensen Huang AGI moment aged badly in real time. Google's Gemini 3.1 Pro led all tested frontier models with a score of 0.37%. OpenAI's GPT-5.4 came in at 0.26%. Anthropic's Claude Opus 4.6 managed 0.25%. xAI's Grok-4.20 scored exactly zero. These aren't cherry-picked bad results — they're the frontier, the absolute cutting edge of what billions of dollars in compute and research can produce right now.

Sam Altman has said OpenAI has "basically built AGI." Microsoft is already marketing a lab focused on ASI — the thing that supposedly comes after AGI. Arm named its new data center chip the "AGI CPU." The term is being stretched until it means whatever is commercially convenient for whoever is using it that quarter. Chollet's foundation is not interested in playing that game.

I think we've achieved AGI.

— Jensen Huang, CEO of Nvidia, on the Lex Fridman Podcast

What Is ARC-AGI-3 and Why Does It Hit Different?

The ARC-AGI-3 benchmark — built by François Chollet and Mike Knoop's ARC Prize Foundation — is not a trivia test, not a coding exam, not another PhD-level science gauntlet. The Foundation stood up an in-house game studio and built 135 original interactive environments from scratch. An AI agent gets dropped into one of these environments with no instructions, no stated objectives, and no description of the rules. It has to explore, infer the goal, form a plan, and execute — all without a hint of what "success" even looks like.

If that sounds trivially easy, you're making the foundation's point. Any five-year-old navigates novel situations like this constantly. Current frontier models cannot. The foundation is offering a public-facing version of the test so anyone can play the same environments the AI agents faced. Try one. Within a few seconds, you understand what's being tested — and why the "G" in AGI matters so much more than the hype suggests.

Previous ARC versions started hard and got solved. ARC-AGI-1, introduced in 2019, eventually fell to test-time training and reasoning models. ARC-AGI-2 lasted about a year before Gemini 3.1 Pro hit 77.1%. Each time, the labs threw compute and targeted training at the benchmark until it was dead. Version 3 was designed specifically to prevent that cycle: 110 of the 135 environments are kept private — 55 semi-private for API testing, 55 fully locked for competition. You cannot memorize your way through novel game logic you've never seen.

How RHAE Scoring Punishes AI Inefficiency

The scoring methodology is worth understanding. ARC-AGI-3 uses what the Foundation calls RHAE — Relative Human Action Efficiency — which sets the baseline at the second-best, first-run human performance. An AI that takes ten times more actions than a human to complete a level scores 1% for that environment, not 10%. The formula squares the penalty for inefficiency. Wandering, backtracking, guessing your way to a solution — all of that gets punished hard.

Under this framework, the best AI agent tested during the month-long developer preview scored 12.58%. That came from a custom-built harness, not a raw API call. Frontier models tested directly through the artificial general intelligence benchmark API, with no custom tooling, couldn't break 1%.

There is one legitimate methodological debate worth flagging. ARC's official API feeds agents JSON code rather than visual input. A Duke-built custom harness pushed Claude Opus 4.6 from its official 0.25% to 97.1% on a single environment variant called TR87 — a result that clearly doesn't represent its overall score but does raise questions about format sensitivity. The Foundation's published paper addresses this directly.

Frame content perception and API format are not limiting factors for frontier model performance on ARC-AGI-3.

— ARC Prize Foundation, ARC-AGI-3 Technical Report

What Does This Mean for AGI Timelines?

The short answer: nobody actually knows, and the people who sound most confident are usually selling something. "There's a bunch of different definitions," Malo Bourgon, CEO of the Machine Intelligence Research Institute, said in a statement — which is the polite way of saying the industry has never agreed on what AGI even means, let alone when it arrives.

Chollet's framing is blunter. If an untrained human can walk in cold and solve every environment, and your multi-billion-dollar model scores less than a third of one percent, then you don't have general intelligence. You have a very expensive autocomplete that requires extensive scaffolding to function. The gap between 0.37% and 100% isn't a gap you close with one more training run.

ARC Prize 2026 is offering $2 million spread across three competition tracks on Kaggle, with one strict rule: every winning solution must be open-sourced. The competition clock is running. The machines are not close. Whether the executives announcing AGI have actually looked at the scores — well, that's a different question entirely.

Frequently Asked Questions

What is ARC-AGI-3?

ARC-AGI-3 is an artificial general intelligence benchmark released by the ARC Prize Foundation in March 2026. It consists of 135 interactive game environments where AI agents must explore, infer goals, and execute plans with zero instructions. Scoring uses RHAE — Relative Human Action Efficiency — which penalizes inefficiency compared to human performance.

How did AI models score on ARC-AGI-3?

Every frontier model scored below 1%. Google's Gemini 3.1 Pro led at 0.37%, OpenAI's GPT-5.4 scored 0.26%, Anthropic's Claude Opus 4.6 scored 0.25%, and xAI's Grok-4.20 scored zero. Untrained humans solved all 135 environments, scoring 100% by comparison.

Why can't AI models be trained to beat ARC-AGI-3?

110 of the 135 environments are kept private — 55 semi-private and 55 fully locked for competition. Without access to the environments, labs cannot specifically train models to solve them. The benchmark was designed to prevent the training-and-saturation pattern that killed ARC-AGI-1 and ARC-AGI-2.

What is the ARC Prize 2026 competition?

ARC Prize 2026 offers $2 million across three competition tracks hosted on Kaggle. All winning solutions must be open-sourced. The competition tests whether AI systems can approach human-level performance on the ARC-AGI-3 benchmark environments, with the clock already running as of March 2026.