Claude Opus 4.7 Review: Benchmarks, Hidden Costs & The Controversy

By AI Workflows Team · April 17, 2026 · 13 min read

Claude Opus 4.7 dominates SWE-bench Pro at 64.3%, but the new tokenizer hides a 35% cost increase and breaking API changes that will snap your Opus 4.6 integrations.

Claude Opus 4.7 Review: Benchmarks, Hidden Costs & The Controversy

TL;DR: Claude Opus 4.7 is a genuine upgrade over 4.6 — it dominates SWE-bench Pro (64.3% vs. GPT-5.4's 57.7%), adds high-resolution vision up to 3.75 megapixels, and introduces better instruction-following for agentic workflows. But the story isn't all clean: the new tokenizer quietly raises real-world costs by up to 35%, several core API parameters have been removed in breaking changes, and a vocal subset of power users is reporting quality regressions on non-coding tasks. And then there's the elephant in the room — Anthropic keeps showing off Mythos Preview, a model that makes Opus 4.7 look like second place, while refusing to release it. This review gives you the full picture.


What Is Claude Opus 4.7? {#what-is-claude-opus-4-7}

Claude Opus 4.7 was officially released on April 16, 2026, and is now generally available across Claude.ai, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Anthropic describes it as "the most capable generally available model to date" — a claim that holds up on software engineering benchmarks, but deserves scrutiny everywhere else.

Opus 4.7 sits at the top of the Claude 4.x family, directly succeeding Opus 4.6. The headline improvements are focused on three areas: software engineering performance, high-resolution vision, and more precise instruction-following. It also introduces new developer controls while simultaneously removing previously supported API parameters — a combination that will require code changes for any team running Opus 4.6 in production.

This isn't a minor increment. The coding benchmark jump is real, the breaking API changes are real, and the controversy around regression reports is also real. Here's what matters.


What's New in Claude Opus 4.7 {#whats-new}

1. High-Resolution Vision: 3.75 Megapixels {#vision}

The single most concrete improvement in Opus 4.7 is vision resolution. Previous Claude models capped image processing at 1,568 pixels on the long edge (~1.15 megapixels). Opus 4.7 now accepts images up to 2,576 pixels on the long edge (~3.75 megapixels) — more than 3× the previous ceiling.

What changes in practice:

  • Reading fine print and small UI text in screenshots (a frequent failure mode before)
  • Parsing dense technical diagrams, circuit schematics, chemical structures
  • Reliable analysis of high-resolution photographs and satellite imagery
  • Improved bounding-box detection and image localization for computer-use agents

Perhaps more importantly, coordinates in Opus 4.7 are now 1:1 with actual pixels — no scale-factor math required. This removes a common class of bugs in vision-based agentic pipelines.

2. xhigh Effort Level {#xhigh}

The Claude 4.x effort controls let developers tune the intelligence-vs-latency tradeoff. Opus 4.7 adds a new xhigh level, slotting between the existing high and max. This gives more granular control for workflows that need more reasoning than high allows but don't want the full cost and latency of max.

Anthropic explicitly recommends xhigh for coding and agentic tasks — and given the benchmark numbers, this is well-calibrated advice.

3. Task Budgets (Public Beta) {#task-budgets}

Task budgets are a new mechanism for guiding how many tokens Claude spends in a full agentic loop. You pass a rough estimate; Claude sees a running countdown and uses it to prioritize work — analogous to telling a contractor how many hours they have rather than giving them an open-ended mandate.

  • Minimum budget: 20,000 tokens
  • Requires beta header: task-budgets-2026-03-13
  • Critical distinction: Task budgets are a suggestion, not a hard cap. They do not replace max_tokens

For production agentic systems where token runaway is a real operational concern, this is a meaningful addition. It's worth distinguishing: you're giving Claude a planning signal, not a ceiling.

4. /ultrareview in Claude Code {#ultrareview}

Claude Code users get a new /ultrareview command for dedicated high-effort review sessions, plus Auto mode for Max plan subscribers. For teams relying on Claude Code as their primary AI coding assistant, /ultrareview adds a distinct high-investment review mode separate from regular code review — useful for final passes before production deploys or major refactors.

5. Better File-System Memory {#memory}

Opus 4.7 shows measurable improvements in writing to and reading from file-system-based memory stores — scratchpad files, structured note files, persistent memory repositories. For multi-session agentic workflows, this means more reliable context retention across conversations, a practical improvement for anyone running long-running coding agents or research pipelines.


Benchmark Performance: Where It Leads, Where It Doesn't {#benchmarks}

Software Engineering — A Real Leap {#swe-benchmarks}

The SWE-bench improvement is the headline, and it's legitimate:

Benchmark Opus 4.6 Opus 4.7 GPT-5.4 Gemini 3.1 Pro
SWE-bench Pro 53.4% 64.3% 57.7% 54.2%
SWE-bench Verified 80.8% 87.6% 80.6%
CursorBench 58% 70%
SWE-bench Multilingual 77.8% 80.5%
Terminal-Bench 69.4%

The SWE-bench Pro jump from 53.4% to 64.3% represents approximately 10 percentage points on a benchmark that measures real-world software engineering tasks — not toy problems. Anthropic's internal 93-task coding benchmark shows a 13% improvement, with resolved tasks that were previously impossible for both Opus 4.6 and Sonnet 4.6.

For teams building software engineering agents, this gap over GPT-5.4 (57.7%) is meaningful and likely translates to real productivity differences on complex codebases.

Vision and Computer Use {#vision-benchmarks}

Benchmark Opus 4.6 Opus 4.7 GPT-5.4
OSWorld-Verified 72.7% 78.0% 75.0%
Visual Acuity 54.5% 98.5%
MCP-Atlas 75.8% 77.3% 68.1%

The visual acuity jump from 54.5% to 98.5% is the most dramatic single-benchmark improvement. This directly corresponds to the resolution upgrade — tasks involving fine-grained visual parsing that Opus 4.6 consistently failed on are now within reach.

Reasoning — Basically Tied at the Frontier {#reasoning-benchmarks}

Benchmark Opus 4.7 GPT-5.4 Gemini 3.1 Pro
GPQA Diamond 94.2% 94.4% 94.3%

Frontier models have essentially saturated academic reasoning benchmarks. A 0.2-percentage-point difference on GPQA Diamond is noise, not signal.

Where Opus 4.7 Trails {#weak-benchmarks}

On web research tasks, Opus 4.7 falls behind its main competitors:

Benchmark Opus 4.7 GPT-5.4 Gemini 3.1 Pro
BrowseComp 79.3% 89.3% 85.9%

A ~10-point gap on BrowseComp is meaningful if your workflows depend on deep web research, competitive monitoring, or information retrieval from live sources. For those use cases, GPT-5.4 currently holds a real advantage.

The Unreported Regression {#regression-benchmark}

Here's the number that doesn't appear in Anthropic's launch announcement: the Thematic Generalization Benchmark dropped to 72.8% in Opus 4.7, down from 80.6% in Opus 4.6. This measures general language understanding and knowledge transfer — the "general intelligence" dimension that correlates with quality on creative writing, analytical work, and open-ended reasoning.

This isn't just a fringe metric. It gives quantitative backing to what many users are reporting anecdotally. The pattern suggests Opus 4.7 may have been optimized specifically for software engineering at some cost to general-purpose language capabilities — a trade-off worth understanding before committing to a migration.


The Hidden Cost Problem: "Unchanged Pricing" Is Misleading {#pricing}

Anthropic's pricing page shows Opus 4.7 at $5 per million input tokens / $25 per million output tokens — identical to Opus 4.6. This is technically accurate. It is not the full story.

The New Tokenizer Problem {#tokenizer}

Opus 4.7 ships with a new tokenizer that uses 1.0x to 1.35x more tokens for the same input text. The same prompt that consumed 10,000 tokens on Opus 4.6 may consume up to 13,500 tokens on Opus 4.7 — at the same headline price.

Real-world impact:

  • Input-heavy workloads: 0–35% cost increase from tokenizer inflation alone
  • Output-heavy workloads: Output tokens are priced at 5× input tokens, so any increase compounds
  • Higher effort levels: xhigh and max generate more output tokens, multiplying the effect

A production workflow spending $1,000/month on Opus 4.6 could spend up to $1,350/month on Opus 4.7 without any changes to task complexity or volume. This is a material budget impact for teams running at scale.

What Can Offset the Increase {#cost-offsets}

Anthropic offers two mechanisms that can meaningfully offset the tokenizer inflation:

  • Prompt caching: Up to 90% savings on repeated context — high-value if your system prompts are long and stable
  • Batch processing: Up to 50% savings on non-latency-sensitive workloads

If you're already aggressively using both, the net impact may be minimal. If you're not — building caching into your architecture is now more urgent, not just an optimization.

Bottom line on costs: "Same pricing" is a press-release statement. Budget for 15–25% higher actual costs when migrating from Opus 4.6, and audit your caching strategy before deployment.


Breaking Changes: What Will Break Your Integration {#breaking-changes}

Opus 4.7 introduces breaking API changes that are unusual in scope for an incremental release. Any production code targeting Opus 4.6 will require explicit updates before pointing at Opus 4.7.

Extended Thinking Budgets — Removed {#thinking-removed}

Previously supported syntax:

"thinking": {"type": "enabled", "budget_tokens": 4000}

In Opus 4.7, this returns a 400 error. Adaptive thinking is the only available thinking mode, and it is off by default.

Migration path: Replace with thinking: {type: "adaptive"} wherever you want reasoning enabled.

Sampling Parameters — Removed {#sampling-removed}

Setting temperature, top_p, or top_k to non-default values now returns a 400 error. These parameters are no longer supported.

Migration path: Remove these parameters from all API calls entirely. Any code that relied on temperature for determinism or creative variance will need rearchitecting.

Thinking Content Omitted by Default {#thinking-display}

Thinking blocks appear in the response stream but the content field is empty unless explicitly requested. To receive reasoning:

"thinking": {"type": "adaptive", "display": "summarized"}

Token Count Estimates Are Now Underestimates {#token-count}

Any token budget logic, rate limiting, or cost projection code calibrated to Opus 4.6's tokenizer will systematically underestimate usage on Opus 4.7 by up to 35%.

These are not configuration tweaks — they require code changes and testing. The scope of migration effort scales with how deeply your application relied on thinking budgets and sampling parameters.


Claude Opus 4.7 vs. GPT-5.4 vs. Gemini 3.1 Pro {#model-comparison}

Category Claude Opus 4.7 GPT-5.4 Gemini 3.1 Pro
Coding (SWE-bench Pro) 64.3% 57.7% 54.2%
Web Research (BrowseComp) 79.3% 89.3% 85.9%
Reasoning (GPQA Diamond) 94.2% 94.4% 94.3%
Computer Use (OSWorld) 78.0% 75.0%
Tool Use (MCP-Atlas) 77.3% 68.1%
Max Image Resolution 3.75MP
Context Window 1M tokens 1M tokens
Input Price $5/M
Output Price $25/M

When Claude Opus 4.7 wins:

  • Software engineering agents and coding automation — by a meaningful margin
  • Computer use and GUI automation tasks
  • Tool use and MCP-based agentic workflows
  • High-resolution image processing and vision tasks

When competitors win:

  • Web research and live information retrieval → GPT-5.4's BrowseComp lead is real and consistent
  • General-purpose language tasks → Thematic Generalization regression is a concern
  • General reasoning → Essentially a three-way tie; pick based on other criteria

If your core workflow is code, Opus 4.7 is the current best option. If it's research, GPT-5.4 deserves serious consideration. For more comparison context, see our full GPT-5.4 vs. Claude Opus 4.6 vs. Gemini 3.1 Pro breakdown.


The Elephant in the Room: Claude Mythos Preview {#mythos}

Anthropic's launch materials prominently benchmark Opus 4.7 against Claude Mythos Preview — a model that Opus 4.7 consistently loses to:

Benchmark Mythos Preview Opus 4.7
SWE-bench Pro 77.8% 64.3%
SWE-bench Verified 93.9% 87.6%
Terminal-Bench 82.0% 69.4%
GPQA Diamond 94.6% 94.2%
OSWorld-Verified 79.6% 78.0%

Mythos Preview outperforms Opus 4.7 by 10–15 percentage points on software engineering. It is not publicly available.

This creates an unusual dynamic: Anthropic is releasing what it calls "the most capable generally available model" while framing it against a model that makes it look like a step-down option. The implication is that Anthropic is building toward Mythos-class general availability at scale — but no timeline has been communicated.

For detailed context on what Mythos Preview represents, see our article Claude Mythos Preview: Anthropic's Most Powerful AI Can Find Zero-Days for $50.

There's also a safety dimension worth noting: Opus 4.7's cybersecurity capabilities were "differentially reduced" relative to Mythos Preview. Anthropic describes this as an intentional choice — Opus 4.7 has real-time cybersecurity safeguards with refusals on high-risk topics. For legitimate security research, red teams, and penetration testers, this capability reduction is a practical concern that may push those workflows toward alternative models.


Real User Concerns: Is This Actually a Regression? {#user-concerns}

The community criticism is real and predates Opus 4.7. Stella Laurenzo, senior director at AMD, published an analysis of 6,852 Claude Code sessions claiming "Claude has regressed to the point it cannot be trusted to perform complex engineering." GitHub issue trackers show reports like "Critical: Opus 4.6 Configuration Regression — 92/100 → 38/100 Performance Drop."

With Opus 4.7, the core complaints center on:

  1. Confident assertion without tool verification — the model makes assertive factual claims rather than using available tools to check
  2. Instruction drift — explicit verification rules in system prompts are inconsistently followed
  3. Creative and analytical flatness — tasks outside coding feel lower quality, less nuanced

The Thematic Generalization Benchmark drop (80.6% → 72.8%) gives these complaints quantitative footing. This isn't anecdote.

Anthropic's Explanation

Anthropic's position distinguishes between model changes and product/interface changes, arguing that some degradation users experienced reflected changed defaults — specifically, how much effort Claude expends at default settings — rather than architectural regressions.

This is a meaningful and credible distinction. If the default effort level dropped between releases and you're comparing at defaults, you may be comparing different effort expenditure levels, not different models.

If you're experiencing quality degradation with Opus 4.7:

  1. Explicitly set thinking: {type: "adaptive"} for complex reasoning tasks
  2. Set effort to xhigh for coding-intensive workflows
  3. Test with and without adaptive thinking on your specific task type
  4. Reinforce critical rules in your system prompt — don't rely on implicit compliance

The model performs materially differently at different effort levels. Many regression reports likely reflect default-setting changes rather than fundamental capability regressions. Test with explicit configuration before drawing conclusions.


Who Should Upgrade to Claude Opus 4.7? {#should-you-upgrade}

Upgrade immediately:

  • Teams building coding agents or automated software engineering pipelines
  • Vision-based agents processing dense technical diagrams or high-resolution images
  • Computer-use and GUI automation workflows
  • Anyone using Claude Code as a primary development tool — the SWE-bench improvement and /ultrareview command are compelling

Upgrade with caution:

  • Cost-sensitive production workloads — budget for tokenizer inflation
  • Any integration using budget_tokens, temperature, top_p, or top_k — these are breaking changes requiring code updates
  • Workloads primarily outside coding — test your specific tasks before committing

Consider deferring:

  • Web research-heavy workflows (GPT-5.4's BrowseComp lead is significant)
  • Tight-budget teams already optimized for Opus 4.6 who aren't doing heavy coding
  • Anyone waiting for Mythos-class capabilities (no timeline available)

For the best AI coding tools ecosystem context, our Best AI Coding Tools 2026 comparison covers how Claude Code with Opus 4.7 sits alongside Cursor, Copilot, and Windsurf.


How to Get Started {#getting-started}

Access Claude Opus 4.7 through:

  • Claude.ai — available in Pro and Max plans
  • Anthropic API — model ID claude-opus-4-7-20260416
  • Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry

Key API configuration changes from Opus 4.6:

# Enable adaptive thinking (required — no longer auto-enabled)
"thinking": {"type": "adaptive"}

# Use xhigh effort for coding and agentic tasks
"effort": "xhigh"

# Task budgets (beta) — requires header: task-budgets-2026-03-13
"task_budget": 50000  # minimum 20,000 tokens

# REMOVE these — will 400 error on Opus 4.7:
# "temperature": 0.7
# "top_p": 0.9
# "thinking": {"type": "enabled", "budget_tokens": 4000}

Access Claude directly on Claude.ai for non-API users, and Claude Code for the IDE-integrated coding experience with /ultrareview and task budget support.


FAQ {#faq}

Is Claude Opus 4.7 better than GPT-5.4? {#faq-vs-gpt}

For software engineering tasks: yes, by a meaningful margin (SWE-bench Pro 64.3% vs. 57.7%). For web research and live information retrieval: no, GPT-5.4 leads significantly (BrowseComp 89.3% vs. 79.3%). For general reasoning: essentially tied. The better model depends entirely on your use case — pick by workload, not by headline.

Did Anthropic raise prices for Claude Opus 4.7? {#faq-pricing}

Officially no — headline pricing is $5/M input and $25/M output, unchanged from Opus 4.6. However, the new tokenizer uses 1.0–1.35x more tokens for the same input text, resulting in a real-world cost increase of 0–35% depending on your content type. "Same price per token" with more tokens consumed per request is not the same as "same cost."

What happened to temperature and top_p support? {#faq-temperature}

Removed. Passing non-default values for temperature, top_p, or top_k now returns a 400 API error in Opus 4.7. Anthropic's position is that these parameters don't improve output quality for the new model architecture. Any code relying on these parameters for deterministic or creative outputs requires migration.

Is Claude Opus 4.7 worth it for creative writing? {#faq-creative}

The Thematic Generalization Benchmark regression (72.8% vs. 80.6% in Opus 4.6) provides quantitative evidence for quality degradation on general language tasks. If creative writing, general analysis, or open-ended reasoning are your primary use cases, test Opus 4.7 against your specific prompts before migrating — don't assume "new version = better everywhere."

What is the xhigh effort level? {#faq-xhigh}

xhigh is a new effort setting between high and max in Opus 4.7. It directs Claude to apply more reasoning than high allows but less than max requires, offering a cost-efficiency middle ground for tasks where you need elevated performance without paying for maximum inference. Anthropic recommends it for coding and agentic tasks.

When will Claude Mythos Preview be publicly available? {#faq-mythos}

Anthropic has not provided a timeline. Mythos Preview is currently not for general release. Anthropic has stated the goal of eventually deploying Mythos-class models at scale, but no dates or conditions have been communicated.


Final Verdict

Claude Opus 4.7 is the right choice if your work centers on software engineering. The SWE-bench Pro improvement from 53.4% to 64.3% is not benchmark theater — it represents real-world gains on complex coding tasks that Opus 4.6 couldn't solve. The vision resolution upgrade and 1:1 pixel coordinates quietly fix a class of computer-use bugs that has frustrated developers for months.

But the honest assessment requires acknowledging the rest: the tokenizer inflation that hides behind "unchanged pricing," the breaking API changes that will require non-trivial migration work, the Thematic Generalization regression that validates user complaints about quality decline outside coding, and the persistent shadow cast by Mythos Preview — a model Anthropic keeps demonstrating but won't ship.

Upgrade for coding agents. Test carefully for everything else. And update your cost projections before you flip the switch.


Sources & References {#sources}