Microsoft Made GPT and Claude Work Together

Microsoft Copilot Researcher Critique pairs GPT and Claude in sequence, scoring 57.4 on the DRACO benchmark — beating all AI research tools in March 2026.

What to Know

Microsoft Copilot Researcher Critique pairs GPT and Claude in sequence — one drafts, the other reviews — before the report reaches you
On the DRACO benchmark, Critique scored 57.4, beating Claude Opus 4.6 alone (42.7) by nearly 14%
A second feature called Council runs both models simultaneously and uses a third AI judge to compare their outputs
Both features require a $30/user/month Microsoft 365 Copilot license plus enrollment in the Frontier early-access program

Microsoft Copilot Researcher Critique is the company's answer to the AI research arms race — and it doesn't pick a side. Announced on Monday, March 30, the feature pairs OpenAI's GPT and Anthropic's Claude on the same task in sequence, with one model drafting and the other playing skeptic. The pitch is simple: stop trusting a single AI to grade its own homework.

Why Single-Model Research Has a Problem

Every major AI lab has spent the past year and a half convincing you that its model is the best researcher in the room. Google launched its Gemini research agent in December 2024. OpenAI followed with its own in February 2025. Perplexity doubled down. Anthropic built a devoted professional following and rolled out its own research agent in April 2025. The pitch was always the same: one model, better than the rest.

Microsoft's argument on March 30 was that the whole framing is wrong. The problem isn't which model you pick — it's that any single model is doing everything alone. It plans the research, pulls the sources, writes the draft, and hands you the result with no second opinion. That's how hallucinations sneak through. That's how citations go unchecked. That's how confident-sounding nonsense ends up in your deliverables.

The Microsoft Copilot Researcher Critique feature breaks that single-model loop by introducing a strict division of labor between GPT and Claude — two systems that don't share the same training, architecture, or blind spots.

How Critique Actually Works

The workflow is cleaner than you'd expect. GPT handles phase one: planning the research query, scourng sources, and producing an initial draft. Then Claude comes in as editor — not co-author, but dedicated critic — tasked with checking factual accuracy, citation quality, and whether the report actually answered the question asked. Only after that review does anything reach the user.

Microsoft says the roles can eventually run in reverse, with Claude drafting and GPT critiquing, though for now GPT leads the generation phase.

The company's own description puts it plainly:

Critique is a new multi model deep research system designed for complex research tasks. It separates generation from evaluation and utilizes a combination of models from Frontier labs, including Anthropic and OpenAI. One model leads the generation phase, planning the task, iterating through retrieval, and producing an initial draft, while a second model focuses on review and refinement, acting as an expert reviewer before the final report is produced.

— Microsoft, official announcement

What Does the DRACO Benchmark Actually Prove?

Benchmark scores in AI are easy to dismiss — companies routinely cherry-pick the tests that flatter them. But the DRACO benchmark is at least a standardized one: 100 complex research tasks across 10 domains including medicine, law, and technology, designed to stress-test deep research capabilities rather than raw language fluency.

Copilot with Critique scored 57.4. Claude Opus 4.6 running solo hit 42.7. That's a gap of nearly 14 percentage points — and this isn't Microsoft comparing itself to some aging model. That's the current best-in-class Anthropic offering on the same task set. The biggest gains showed up in breadth of analysis and presentation quality, with factual accuracy also posting a meaningful improvement.

The cynical read is actually the interesting one here. Microsoft doesn't need to have the best model. It just needs to orchestrate the best models better than anyone else. A 14% lead on a task-specific benchmark that directly tests what Critique was built to do is harder to wave away than most marketing claims in AI. That said — Microsoft wrote the press release, not an independent lab.

Council, Pricing, and What Microsoft Is Really Betting On

Critique is the collaborative version — GPT and Claude working in sequence. Council is the adversarial one. Instead of sequential review, Council runs both models on the same task simultaneously and generates two complete, independent reports. A third judge model then reads both outputs and writes a summary of where they agreed, where they diverged, and what unique angles each one caught that the other missed.

That kind of manual comparison has always been possible if you wanted to paste the same prompt into ChatGPT and Claude.ai and spend twenty minutes doing the diff yourself. Council just automates the grunt work.

Critique is the default mode inside Copilot Researcher. Council requires manually selecting 'Model Council' from the picker — it's opt-in, which makes sense given it generates two full reports. Both features currently require a Microsoft 365 Copilot license at $30 per user per month, plus enrollment in Microsoft's Frontier early-access program. No general availability date has been announced.

OpenAI and Microsoft have a multibillion-dollar partnership. Microsoft is also one of Anthropic's largest enterprise customers. By building a layer that runs both simultaneously and treats neither as permanent, Microsoft is sending a message to every AI lab: the orchestration platform matters more than any single model. No one stays on top for long — and Microsoft is betting the house on that staying true.

Frequently Asked Questions

What is Microsoft Copilot Researcher Critique?

Critique is a multi-model research feature inside Microsoft 365 Copilot that uses GPT and Claude in sequence. GPT handles drafting and source retrieval; Claude reviews the draft for accuracy and citation quality before the report is delivered to the user. It was announced on March 30, 2026.

How did Critique score on the DRACO benchmark?

Copilot with Critique scored 57.4 on the DRACO benchmark, which tests 100 complex research tasks across 10 domains. Claude Opus 4.6 alone scored 42.7 on the same test. That gap of nearly 14 percentage points reflects the biggest gains in breadth of analysis and presentation quality.

What is the difference between Critique and Council?

Critique runs GPT and Claude in sequence — one drafts, the other reviews. Council runs both models simultaneously on the same task and uses a third judge model to compare outputs side by side. Critique is the default; Council must be manually selected from the Copilot Researcher model picker.

How much does Copilot Researcher Critique cost?

Access requires a Microsoft 365 Copilot license at $30 per user per month, plus enrollment in Microsoft's Frontier early-access program. Both Critique and Council are currently limited to Frontier users, and no general availability date has been announced as of March 2026.