Claude Opus 4.7 Review: Benchmarks, Hidden Costs & The Controversy

Q: What happened to temperature and top_p support? {#faq-temperature}

Removed. Passing non-default values for temperature, top_p, or top_k now returns a 400 API error in Opus 4.7. Anthropic's position is that these parameters don't improve output quality for the new model architecture. Any code relying on these parameters for deterministic or creative outputs requires migration.

Q: What is the xhigh effort level? {#faq-xhigh}

xhigh is a new effort setting between high and max in Opus 4.7. It directs Claude to apply more reasoning than high allows but less than max requires, offering a cost-efficiency middle ground for tasks where you need elevated performance without paying for maximum inference. Anthropic recommends it for coding and agentic tasks.

By AI Workflows Team · April 17, 2026 · 13 min read

Claude Opus 4.7 dominates SWE-bench Pro at 64.3%, but the new tokenizer hides a 35% cost increase and breaking API changes that will snap your Opus 4.6 integrations.

Claude Opus 4.7 Review: Benchmarks, Hidden Costs & The Controversy

TL;DR: Claude Opus 4.7 is a genuine upgrade over 4.6 — it dominates SWE-bench Pro (64.3% vs. GPT-5.4's 57.7%), adds high-resolution vision up to 3.75 megapixels, and introduces better instruction-following for agentic workflows. But the story isn't all clean: the new tokenizer quietly raises real-world costs by up to 35%, several core API parameters have been removed in breaking changes, and a vocal subset of power users is reporting quality regressions on non-coding tasks. And then there's the elephant in the room — Anthropic keeps showing off Mythos Preview, a model that makes Opus 4.7 look like second place, while refusing to release it. This review gives you the full picture.

What Is Claude Opus 4.7? {#what-is-claude-opus-4-7}

Claude Opus 4.7 was officially released on April 16, 2026, and is now generally available across Claude.ai, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Anthropic describes it as "the most capable generally available model to date" — a claim that holds up on software engineering benchmarks, but deserves scrutiny everywhere else.

Opus 4.7 sits at the top of the Claude 4.x family, directly succeeding Opus 4.6. The headline improvements are focused on three areas: software engineering performance, high-resolution vision, and more precise instruction-following. It also introduces new developer controls while simultaneously removing previously supported API parameters — a combination that will require code changes for any team running Opus 4.6 in production.

This isn't a minor increment. The coding benchmark jump is real, the breaking API changes are real, and the controversy around regression reports is also real. Here's what matters.

What's New in Claude Opus 4.7 {#whats-new}

1. High-Resolution Vision: 3.75 Megapixels {#vision}

The single most concrete improvement in Opus 4.7 is vision resolution. Previous Claude models capped image processing at 1,568 pixels on the long edge (~1.15 megapixels). Opus 4.7 now accepts images up to 2,576 pixels on the long edge (~3.75 megapixels) — more than 3× the previous ceiling.

What changes in practice:

Reading fine print and small UI text in screenshots (a frequent failure mode before)
Parsing dense technical diagrams, circuit schematics, chemical structures
Reliable analysis of high-resolution photographs and satellite imagery
Improved bounding-box detection and image localization for computer-use agents

Perhaps more importantly, coordinates in Opus 4.7 are now 1:1 with actual pixels — no scale-factor math required. This removes a common class of bugs in vision-based agentic pipelines.

2. `xhigh` Effort Level {#xhigh}

The Claude 4.x effort controls let developers tune the intelligence-vs-latency tradeoff. Opus 4.7 adds a new xhigh level, slotting between the existing high and max. This gives more granular control for workflows that need more reasoning than high allows but don't want the full cost and latency of max.

Anthropic explicitly recommends xhigh for coding and agentic tasks — and given the benchmark numbers, this is well-calibrated advice.

3. Task Budgets (Public Beta) {#task-budgets}

Task budgets are a new mechanism for guiding how many tokens Claude spends in a full agentic loop. You pass a rough estimate; Claude sees a running countdown and uses it to prioritize work — analogous to telling a contractor how many hours they have rather than giving them an open-ended mandate.

Minimum budget: 20,000 tokens
Requires beta header: task-budgets-2026-03-13
Critical distinction: Task budgets are a suggestion, not a hard cap. They do not replace max_tokens

For production agentic systems where token runaway is a real operational concern, this is a meaningful addition. It's worth distinguishing: you're giving Claude a planning signal, not a ceiling.

4. `/ultrareview` in Claude Code {#ultrareview}

Claude Code users get a new /ultrareview command for dedicated high-effort review sessions, plus Auto mode for Max plan subscribers. For teams relying on Claude Code as their primary AI coding assistant, /ultrareview adds a distinct high-investment review mode separate from regular code review — useful for final passes before production deploys or major refactors.

5. Better File-System Memory {#memory}

Opus 4.7 shows measurable improvements in writing to and reading from file-system-based memory stores — scratchpad files, structured note files, persistent memory repositories. For multi-session agentic workflows, this means more reliable context retention across conversations, a practical improvement for anyone running long-running coding agents or research pipelines.

Benchmark Performance: Where It Leads, Where It Doesn't {#benchmarks}

Software Engineering — A Real Leap {#swe-benchmarks}

The SWE-bench improvement is the headline, and it's legitimate:

Benchmark	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Pro	53.4%	64.3%	57.7%	54.2%
SWE-bench Verified	80.8%	87.6%	—	80.6%
CursorBench	58%	70%	—	—
SWE-bench Multilingual	77.8%	80.5%	—	—
Terminal-Bench	—	69.4%	—	—

The SWE-bench Pro jump from 53.4% to 64.3% represents approximately 10 percentage points on a benchmark that measures real-world software engineering tasks — not toy problems. Anthropic's internal 93-task coding benchmark shows a 13% improvement, with resolved tasks that were previously impossible for both Opus 4.6 and Sonnet 4.6.

For teams building software engineering agents, this gap over GPT-5.4 (57.7%) is meaningful and likely translates to real productivity differences on complex codebases.

Vision and Computer Use {#vision-benchmarks}

Benchmark	Opus 4.6	Opus 4.7	GPT-5.4
OSWorld-Verified	72.7%	78.0%	75.0%
Visual Acuity	54.5%	98.5%	—
MCP-Atlas	75.8%	77.3%	68.1%

The visual acuity jump from 54.5% to 98.5% is the most dramatic single-benchmark improvement. This directly corresponds to the resolution upgrade — tasks involving fine-grained visual parsing that Opus 4.6 consistently failed on are now within reach.

Reasoning — Basically Tied at the Frontier {#reasoning-benchmarks}

Benchmark	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
GPQA Diamond	94.2%	94.4%	94.3%

Frontier models have essentially saturated academic reasoning benchmarks. A 0.2-percentage-point difference on GPQA Diamond is noise, not signal.

Where Opus 4.7 Trails {#weak-benchmarks}

On web research tasks, Opus 4.7 falls behind its main competitors:

Benchmark	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
BrowseComp	79.3%	89.3%	85.9%

A ~10-point gap on BrowseComp is meaningful if your workflows depend on deep web research, competitive monitoring, or information retrieval from live sources. For those use cases, GPT-5.4 currently holds a real advantage.

The Unreported Regression {#regression-benchmark}

Here's the number that doesn't appear in Anthropic's launch announcement: the Thematic Generalization Benchmark dropped to 72.8% in Opus 4.7, down from 80.6% in Opus 4.6. This measures general language understanding and knowledge transfer — the "general intelligence" dimension that correlates with quality on creative writing, analytical work, and open-ended reasoning.

This isn't just a fringe metric. It gives quantitative backing to what many users are reporting anecdotally. The pattern suggests Opus 4.7 may have been optimized specifically for software engineering at some cost to general-purpose language capabilities — a trade-off worth understanding before committing to a migration.

The Hidden Cost Problem: "Unchanged Pricing" Is Misleading {#pricing}

Anthropic's pricing page shows Opus 4.7 at $5 per million input tokens / $25 per million output tokens — identical to Opus 4.6. This is technically accurate. It is not the full story.

The New Tokenizer Problem {#tokenizer}

Opus 4.7 ships with a new tokenizer that uses 1.0x to 1.35x more tokens for the same input text. The same prompt that consumed 10,000 tokens on Opus 4.6 may consume up to 13,500 tokens on Opus 4.7 — at the same headline price.

Real-world impact:

Input-heavy workloads: 0–35% cost increase from tokenizer inflation alone
Output-heavy workloads: Output tokens are priced at 5× input tokens, so any increase compounds
Higher effort levels: xhigh and max generate more output tokens, multiplying the effect

A production workflow spending $1,000/month on Opus 4.6 could spend up to $1,350/month on Opus 4.7 without any changes to task complexity or volume. This is a material budget impact for teams running at scale.

What Can Offset the Increase {#cost-offsets}

Anthropic offers two mechanisms that can meaningfully offset the tokenizer inflation:

Prompt caching: Up to 90% savings on repeated context — high-value if your system prompts are long and stable
Batch processing: Up to 50% savings on non-latency-sensitive workloads

If you're already aggressively using both, the net impact may be minimal. If you're not — building caching into your architecture is now more urgent, not just an optimization.

Bottom line on costs: "Same pricing" is a press-release statement. Budget for 15–25% higher actual costs when migrating from Opus 4.6, and audit your caching strategy before deployment.

Breaking Changes: What Will Break Your Integration {#breaking-changes}

Opus 4.7 introduces breaking API changes that are unusual in scope for an incremental release. Any production code targeting Opus 4.6 will require explicit updates before pointing at Opus 4.7.

Extended Thinking Budgets — Removed {#thinking-removed}

Previously supported syntax:

"thinking": {"type": "enabled", "budget_tokens": 4000}

In Opus 4.7, this returns a 400 error. Adaptive thinking is the only available thinking mode, and it is off by default.

Migration path: Replace with thinking: {type: "adaptive"} wherever you want reasoning enabled.

Sampling Parameters — Removed {#sampling-removed}

Setting temperature, top_p, or top_k to non-default values now returns a 400 error. These parameters are no longer supported.

Migration path: Remove these parameters from all API calls entirely. Any code that relied on temperature for determinism or creative variance will need rearchitecting.

Thinking Content Omitted by Default {#thinking-display}

Thinking blocks appear in the response stream but the content field is empty unless explicitly requested. To receive reasoning:

"thinking": {"type": "adaptive", "display": "summarized"}

Token Count Estimates Are Now Underestimates {#token-count}

Any token budget logic, rate limiting, or cost projection code calibrated to Opus 4.6's tokenizer will systematically underestimate usage on Opus 4.7 by up to 35%.

These are not configuration tweaks — they require code changes and testing. The scope of migration effort scales with how deeply your application relied on thinking budgets and sampling parameters.

Claude Opus 4.7 vs. GPT-5.4 vs. Gemini 3.1 Pro {#model-comparison}

Category	Claude Opus 4.7	GPT-5.4	Gemini 3.1 Pro
Coding (SWE-bench Pro)	✅ 64.3%	57.7%	54.2%
Web Research (BrowseComp)	79.3%	✅ 89.3%	85.9%
Reasoning (GPQA Diamond)	94.2%	94.4%	94.3%
Computer Use (OSWorld)	✅ 78.0%	75.0%	—
Tool Use (MCP-Atlas)	✅ 77.3%	68.1%	—
Max Image Resolution	✅ 3.75MP	—	—
Context Window	1M tokens	—	1M tokens
Input Price	$5/M	—	—
Output Price	$25/M	—	—

When Claude Opus 4.7 wins:

Software engineering agents and coding automation — by a meaningful margin
Computer use and GUI automation tasks
Tool use and MCP-based agentic workflows
High-resolution image processing and vision tasks

When competitors win:

Web research and live information retrieval → GPT-5.4's BrowseComp lead is real and consistent
General-purpose language tasks → Thematic Generalization regression is a concern
General reasoning → Essentially a three-way tie; pick based on other criteria

If your core workflow is code, Opus 4.7 is the current best option. If it's research, GPT-5.4 deserves serious consideration. For more comparison context, see our full GPT-5.4 vs. Claude Opus 4.6 vs. Gemini 3.1 Pro breakdown.

The Elephant in the Room: Claude Mythos Preview {#mythos}

Anthropic's launch materials prominently benchmark Opus 4.7 against Claude Mythos Preview — a model that Opus 4.7 consistently loses to:

Benchmark	Mythos Preview	Opus 4.7
SWE-bench Pro	77.8%	64.3%
SWE-bench Verified	93.9%	87.6%
Terminal-Bench	82.0%	69.4%
GPQA Diamond	94.6%	94.2%
OSWorld-Verified	79.6%	78.0%

Mythos Preview outperforms Opus 4.7 by 10–15 percentage points on software engineering. It is not publicly available.

This creates an unusual dynamic: Anthropic is releasing what it calls "the most capable generally available model" while framing it against a model that makes it look like a step-down option. The implication is that Anthropic is building toward Mythos-class general availability at scale — but no timeline has been communicated.

For detailed context on what Mythos Preview represents, see our article Claude Mythos Preview: Anthropic's Most Powerful AI Can Find Zero-Days for $50.

There's also a safety dimension worth noting: Opus 4.7's cybersecurity capabilities were "differentially reduced" relative to Mythos Preview. Anthropic describes this as an intentional choice — Opus 4.7 has real-time cybersecurity safeguards with refusals on high-risk topics. For legitimate security research, red teams, and penetration testers, this capability reduction is a practical concern that may push those workflows toward alternative models.

Real User Concerns: Is This Actually a Regression? {#user-concerns}

The community criticism is real and predates Opus 4.7. Stella Laurenzo, senior director at AMD, published an analysis of 6,852 Claude Code sessions claiming "Claude has regressed to the point it cannot be trusted to perform complex engineering." GitHub issue trackers show reports like "Critical: Opus 4.6 Configuration Regression — 92/100 → 38/100 Performance Drop."

With Opus 4.7, the core complaints center on:

Confident assertion without tool verification — the model makes assertive factual claims rather than using available tools to check
Instruction drift — explicit verification rules in system prompts are inconsistently followed
Creative and analytical flatness — tasks outside coding feel lower quality, less nuanced

The Thematic Generalization Benchmark drop (80.6% → 72.8%) gives these complaints quantitative footing. This isn't anecdote.

Anthropic's Explanation

Anthropic's position distinguishes between model changes and product/interface changes, arguing that some degradation users experienced reflected changed defaults — specifically, how much effort Claude expends at default settings — rather than architectural regressions.

This is a meaningful and credible distinction. If the default effort level dropped between releases and you're comparing at defaults, you may be comparing different effort expenditure levels, not different models.

If you're experiencing quality degradation with Opus 4.7:

Explicitly set thinking: {type: "adaptive"} for complex reasoning tasks
Set effort to xhigh for coding-intensive workflows
Test with and without adaptive thinking on your specific task type
Reinforce critical rules in your system prompt — don't rely on implicit compliance

The model performs materially differently at different effort levels. Many regression reports likely reflect default-setting changes rather than fundamental capability regressions. Test with explicit configuration before drawing conclusions.

Who Should Upgrade to Claude Opus 4.7? {#should-you-upgrade}

Upgrade immediately:

Teams building coding agents or automated software engineering pipelines
Vision-based agents processing dense technical diagrams or high-resolution images
Computer-use and GUI automation workflows
Anyone using Claude Code as a primary development tool — the SWE-bench improvement and /ultrareview command are compelling

Upgrade with caution:

Cost-sensitive production workloads — budget for tokenizer inflation
Any integration using budget_tokens, temperature, top_p, or top_k — these are breaking changes requiring code updates
Workloads primarily outside coding — test your specific tasks before committing

Consider deferring:

Web research-heavy workflows (GPT-5.4's BrowseComp lead is significant)
Tight-budget teams already optimized for Opus 4.6 who aren't doing heavy coding
Anyone waiting for Mythos-class capabilities (no timeline available)

For the best AI coding tools ecosystem context, our Best AI Coding Tools 2026 comparison covers how Claude Code with Opus 4.7 sits alongside Cursor, Copilot, and Windsurf.

How to Get Started {#getting-started}

Access Claude Opus 4.7 through:

Claude.ai — available in Pro and Max plans
Anthropic API — model ID claude-opus-4-7-20260416
Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry

Key API configuration changes from Opus 4.6:

# Enable adaptive thinking (required — no longer auto-enabled)
"thinking": {"type": "adaptive"}

# Use xhigh effort for coding and agentic tasks
"effort": "xhigh"

# Task budgets (beta) — requires header: task-budgets-2026-03-13
"task_budget": 50000  # minimum 20,000 tokens

# REMOVE these — will 400 error on Opus 4.7:
# "temperature": 0.7
# "top_p": 0.9
# "thinking": {"type": "enabled", "budget_tokens": 4000}

Access Claude directly on Claude.ai for non-API users, and Claude Code for the IDE-integrated coding experience with /ultrareview and task budget support.

FAQ {#faq}

Is Claude Opus 4.7 better than GPT-5.4? {#faq-vs-gpt}

For software engineering tasks: yes, by a meaningful margin (SWE-bench Pro 64.3% vs. 57.7%). For web research and live information retrieval: no, GPT-5.4 leads significantly (BrowseComp 89.3% vs. 79.3%). For general reasoning: essentially tied. The better model depends entirely on your use case — pick by workload, not by headline.

Did Anthropic raise prices for Claude Opus 4.7? {#faq-pricing}

Officially no — headline pricing is $5/M input and $25/M output, unchanged from Opus 4.6. However, the new tokenizer uses 1.0–1.35x more tokens for the same input text, resulting in a real-world cost increase of 0–35% depending on your content type. "Same price per token" with more tokens consumed per request is not the same as "same cost."

What happened to temperature and top_p support? {#faq-temperature}

Removed. Passing non-default values for temperature, top_p, or top_k now returns a 400 API error in Opus 4.7. Anthropic's position is that these parameters don't improve output quality for the new model architecture. Any code relying on these parameters for deterministic or creative outputs requires migration.

Is Claude Opus 4.7 worth it for creative writing? {#faq-creative}

The Thematic Generalization Benchmark regression (72.8% vs. 80.6% in Opus 4.6) provides quantitative evidence for quality degradation on general language tasks. If creative writing, general analysis, or open-ended reasoning are your primary use cases, test Opus 4.7 against your specific prompts before migrating — don't assume "new version = better everywhere."

What is the xhigh effort level? {#faq-xhigh}

xhigh is a new effort setting between high and max in Opus 4.7. It directs Claude to apply more reasoning than high allows but less than max requires, offering a cost-efficiency middle ground for tasks where you need elevated performance without paying for maximum inference. Anthropic recommends it for coding and agentic tasks.

When will Claude Mythos Preview be publicly available? {#faq-mythos}

Anthropic has not provided a timeline. Mythos Preview is currently not for general release. Anthropic has stated the goal of eventually deploying Mythos-class models at scale, but no dates or conditions have been communicated.

Final Verdict

Claude Opus 4.7 is the right choice if your work centers on software engineering. The SWE-bench Pro improvement from 53.4% to 64.3% is not benchmark theater — it represents real-world gains on complex coding tasks that Opus 4.6 couldn't solve. The vision resolution upgrade and 1:1 pixel coordinates quietly fix a class of computer-use bugs that has frustrated developers for months.

But the honest assessment requires acknowledging the rest: the tokenizer inflation that hides behind "unchanged pricing," the breaking API changes that will require non-trivial migration work, the Thematic Generalization regression that validates user complaints about quality decline outside coding, and the persistent shadow cast by Mythos Preview — a model Anthropic keeps demonstrating but won't ship.

Upgrade for coding agents. Test carefully for everything else. And update your cost projections before you flip the switch.

Claude Opus 4.7 Review: Benchmarks, Hidden Costs & The Controversy

Claude Opus 4.7 Review: Benchmarks, Hidden Costs & The Controversy

What Is Claude Opus 4.7? {#what-is-claude-opus-4-7}

What's New in Claude Opus 4.7 {#whats-new}

1. High-Resolution Vision: 3.75 Megapixels {#vision}

2. xhigh Effort Level {#xhigh}

3. Task Budgets (Public Beta) {#task-budgets}

4. /ultrareview in Claude Code {#ultrareview}

5. Better File-System Memory {#memory}

Benchmark Performance: Where It Leads, Where It Doesn't {#benchmarks}

Software Engineering — A Real Leap {#swe-benchmarks}

Vision and Computer Use {#vision-benchmarks}

Reasoning — Basically Tied at the Frontier {#reasoning-benchmarks}

Where Opus 4.7 Trails {#weak-benchmarks}

The Unreported Regression {#regression-benchmark}

The Hidden Cost Problem: "Unchanged Pricing" Is Misleading {#pricing}

The New Tokenizer Problem {#tokenizer}

What Can Offset the Increase {#cost-offsets}

Breaking Changes: What Will Break Your Integration {#breaking-changes}

Extended Thinking Budgets — Removed {#thinking-removed}

Sampling Parameters — Removed {#sampling-removed}

Thinking Content Omitted by Default {#thinking-display}

Token Count Estimates Are Now Underestimates {#token-count}

Claude Opus 4.7 vs. GPT-5.4 vs. Gemini 3.1 Pro {#model-comparison}

The Elephant in the Room: Claude Mythos Preview {#mythos}

Real User Concerns: Is This Actually a Regression? {#user-concerns}

Anthropic's Explanation

Who Should Upgrade to Claude Opus 4.7? {#should-you-upgrade}

How to Get Started {#getting-started}

FAQ {#faq}

Is Claude Opus 4.7 better than GPT-5.4? {#faq-vs-gpt}

Did Anthropic raise prices for Claude Opus 4.7? {#faq-pricing}

What happened to temperature and top_p support? {#faq-temperature}

Is Claude Opus 4.7 worth it for creative writing? {#faq-creative}

What is the xhigh effort level? {#faq-xhigh}

When will Claude Mythos Preview be publicly available? {#faq-mythos}

Final Verdict

Sources & References {#sources}

2. `xhigh` Effort Level {#xhigh}

4. `/ultrareview` in Claude Code {#ultrareview}