Claude Opus 4.8 Review: Benchmarks & Dynamic Workflows (2026)
By AI Workflows Team · June 1, 2026 · 12 min read
Claude Opus 4.8 lands with a 69.2% SWE-bench Pro score, a 3x-cheaper Fast mode, and dynamic workflows that run hundreds of parallel subagents in Claude Code. Our review covers the benchmarks it wins, the ones it loses, pricing, and a use-case upgrade guide.
Claude Opus 4.8 Review: Benchmarks, Dynamic Workflows & Should You Upgrade (2026)
TL;DR — Quick Verdict
Claude Opus 4.8 shipped on May 28, 2026, and it's the most consequential Claude release for engineering teams since the 4.x line began. The headline number is SWE-bench Pro at 69.2% (up from 64.3% on Opus 4.7), but the real story is two quieter shifts: a new dynamic workflows mode that lets Claude Code orchestrate hundreds of parallel subagents on its own, and a Fast mode that's now 3x cheaper than before. If you already run agentic pipelines or build them with a setup like the Autonomous AI Agent workflow, 4.8 is a clear upgrade. If you use Claude for chat and light coding, the gains are smaller, though the honesty improvements alone may justify the switch.
This review covers what changed, the benchmarks where 4.8 wins (and where it still loses), pricing, how parallel subagents actually work, and a use-case-by-use-case upgrade guide.
What's New in Claude Opus 4.8
Claude Opus 4.8 is Anthropic's flagship model released on May 28, 2026, focused on agentic coding, parallel task orchestration, and code honesty rather than raw reasoning gains. It's available immediately on claude.ai, the Claude API (claude-opus-4-8), Amazon Bedrock, Google Vertex AI, and Microsoft Foundry.
Four changes stand out from the previous generation:
- Dynamic workflows in Claude Code: the model now plans large tasks, spins up hundreds of parallel subagents, verifies their output, and reports back without manual orchestration.
- A 3x-cheaper Fast mode: the optional 2.5x-speed tier dropped from $30/$150 to $10/$50 per million tokens.
- Mid-task system messages: the Messages API now accepts system entries partway through a conversation without breaking the prompt cache.
- Measurable honesty gains: Opus 4.8 is roughly 4x less likely than 4.7 to let flaws in its own code pass unremarked.
What did not change is just as telling. Standard pricing held at $5 input / $25 output per million tokens, and several reasoning benchmarks moved only fractionally. This release is tuned for people who ship code with agents, not for people chasing leaderboard reasoning scores.
The bottleneck in agentic coding was never raw intelligence. It was orchestration and trust. You can't hand work to an agent you have to double-check line by line, and you can't scale past one agent if wiring them together eats your day. That's why the honesty and subagent features in 4.8 matter more than the benchmark bump for anyone actually shipping with agents.
Benchmarks: Where 4.8 Wins and Loses
Opus 4.8 posts its biggest gains on hard, unsaturated coding and agentic benchmarks, while reasoning scores stay roughly flat and two benchmarks actually favor competitors. Honest reporting matters here, because most launch coverage only shows the wins.
Here's the full picture, drawn from Anthropic's release data and Vellum's independent benchmark breakdown:
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 88.6% | — | — | — |
| SWE-bench Pro | 69.2% | 64.3% | 58.6% | 54.2% |
| Humanity's Last Exam (w/ tools) | 57.9% | 54.7% | 52.2% | 51.4% |
| GDPval-AA (Elo) | 1,890 | 1,753 | 1,769 | 1,314 |
| OSWorld-Verified | 83.4% | 82.8% | 78.7% | 76.2% |
| Terminal-Bench 2.1 | strong | — | 78.2% | — |
| Finance Agent v2 | 53.9% | 51.5% | 51.8% | — |
Three takeaways the leaderboard tables hide:
SWE-bench Pro is the headline that matters. SWE-bench Verified is approaching saturation near 88-90%, so a 4.9-point jump on the harder, less-saturated Pro set (69.2%) is where the real headroom shows. This is the benchmark closest to messy, real-world repository work.
GPT-5.5 still owns the terminal. On Terminal-Bench 2.1, GPT-5.5 paired with the Codex CLI hits 78.2% (83.4% in some configs) and beats Opus 4.8. If your work is heavily shell-driven, ChatGPT's coding stack is still competitive.
A small model beats the flagship on finance. Gemini 3.5 Flash scores 57.9% on Finance Agent v2, higher than Opus 4.8's 53.9%. The lesson: flagship models are not automatically best at every narrow task, and a cheaper specialized model sometimes wins.
GPQA Diamond, a pure-reasoning science benchmark, actually slipped about 0.6 points versus 4.7. Anthropic clearly traded a sliver of abstract reasoning for agentic reliability. That's a reasonable bet for a coding-focused release, but worth knowing if you use Claude for hard scientific reasoning.
Dynamic Workflows & Parallel Subagents Explained
Dynamic workflows is a research-preview feature in Claude Code that lets Opus 4.8 break a large task into a plan, launch up to hundreds of parallel subagents to execute the pieces, verify their outputs, and report results, all without the user wiring the orchestration by hand.
This is the single biggest practical change in 4.8. Previously, running multiple Claude agents in parallel meant you wrote the coordination logic yourself: spawning the agents, passing context between them, collecting results, resolving conflicts. With dynamic workflows, Claude itself plans the decomposition and manages the fan-out.
How it works in practice:
- Claude plans the work: it reads the task (say, a codebase-scale migration across hundreds of thousands of lines) and decomposes it into independent units.
- It distributes across subagents: each subagent gets a scoped slice of the job and runs concurrently.
- It verifies before reporting: outputs are checked against the plan before they reach you, rather than dumped raw.
- The plan lives outside the context window: workflow state is held in JavaScript variables, not in Claude's token context, which is how it scales past what a single context window could track.
The hard ceilings keep it controllable:
| Limit | Value |
|---|---|
| Total subagents per workflow | 1,000 |
| Concurrent subagents | 16 |
| Availability | Claude Code only (research preview) |
| Plan tiers | Enterprise, Team, Max |
| Subagent token cost | Standard Opus 4.8 rate — no premium |
That last row is important: subagents are billed at the normal $5/$25 rate, with no orchestration surcharge. The cost scales with total tokens consumed, not with how many agents you spin up. A 200-subagent migration costs the same per token as a single long session. You just get there faster and in parallel.
For anyone already designing multi-agent systems, this collapses a lot of glue code. If you're building one from scratch, our Autonomous AI Agent Setup workflow pairs naturally with this feature: Claude handles the in-session orchestration while you design the higher-level agent roles and handoffs.
Pricing and the New 3x-Cheaper Fast Mode
Claude Opus 4.8 keeps standard pricing identical to 4.7, at $5 per million input tokens and $25 per million output tokens, while cutting the optional Fast mode from $30/$150 to $10/$50, a 3x reduction.
| Tier | Input (per 1M) | Output (per 1M) | Speed | Notes |
|---|---|---|---|---|
| Standard | $5 | $25 | 1x | Same as Opus 4.7 |
| Fast mode | $10 | $50 | ~2.5x | Was $30/$150 on 4.7 |
The Fast mode change is the quiet win for production teams. At 2.5x the speed of standard inference, it was previously a luxury tier most teams couldn't justify at $30/$150. At $10/$50, latency-sensitive agentic loops like code review bots, CI assistants, and interactive pair-programming become affordable to run on Fast mode by default.
For context on how this fits the broader pricing landscape, the flagship rate of $5/$25 sits well above budget models like Gemini 3.5 Flash. But the gap closes fast once you factor in fewer retries from 4.8's higher first-pass accuracy. A model that gets it right once is cheaper than a cheap model you have to re-prompt three times.
Opus 4.8 vs 4.7 vs GPT-5.5 vs Gemini 3.5
Choosing between the 2026 flagships comes down to your dominant workload: Opus 4.8 leads on repository-scale coding and professional tasks, GPT-5.5 wins terminal work, and Gemini 3.5 Flash wins on price and some narrow agent tasks.
| Factor | Opus 4.8 | GPT-5.5 | Gemini 3.5 Flash |
|---|---|---|---|
| Best at | Repo-scale coding, agentic orchestration | Terminal/CLI work | Cost, narrow agents |
| SWE-bench Pro | 69.2% | 58.6% | — |
| Terminal-Bench | strong | 78.2% (w/ Codex) | — |
| Parallel subagents | Native (1,000) | Manual | Manual |
| Standard price (in/out) | $5 / $25 | varies | ~$1.50 / $9 |
| Code honesty | 4x better than 4.7 | baseline | baseline |
The practical read for a mid-level team:
- Pick Opus 4.8 if you do large refactors, multi-file features, or run agent swarms. The SWE-bench Pro lead and dynamic workflows are decisive here.
- Pick GPT-5.5 if your workflow is terminal-heavy and you live in the Codex CLI. It still wins Terminal-Bench. Compare the two coding stacks in our AI coding tools comparison.
- Pick Gemini 3.5 Flash if cost dominates and your tasks are narrow. At roughly $1.50/$9 it's a fraction of the price, and it even beats Opus 4.8 on Finance Agent v2.
There's a winner per workload, not a winner overall. Most serious teams now route different task types to different models, and Opus 4.8 has become the default for the hardest coding jobs.
The Honesty Upgrade and Why It Matters for Code
Opus 4.8 is approximately 4x less likely than Opus 4.7 to let flaws in its own generated code pass without flagging them. Anthropic's alignment team describes the change as reaching new highs on prosocial traits.
This sounds abstract until you've shipped agent-written code. The failure mode of earlier models wasn't that they wrote bad code. It's that they wrote bad code and presented it confidently. You'd get a clean-looking diff with a subtle off-by-one or an unhandled edge case, no warning attached. Catching it was on you.
A 4x reduction in unflagged flaws changes the trust equation for delegation. When Claude says "this passes, but I'm uncertain about the error handling on line 40," you can triage. When it stays silent on a flaw, you can't. For agentic workflows where a human reviews dozens of subagent outputs, a model that flags its own doubts is the difference between a usable pipeline and a liability.
Anthropic positioned the alignment quality as approaching the level of its experimental Claude Mythos preview: the same honest-uncertainty behavior, now in a production model. For teams that abandoned full delegation because they couldn't trust silent agents, this is the feature that brings it back on the table.
Should You Upgrade? A Decision Guide
Whether to move from Opus 4.7 to 4.8 depends on how you use Claude. Here's the honest breakdown by use case rather than a blanket "yes."
| Your use case | Upgrade? | Why |
|---|---|---|
| Large refactors / repo-scale migrations | Yes, immediately | Dynamic workflows + SWE-bench Pro lead are built for this |
| Running multi-agent pipelines | Yes | Native parallel subagents replace your orchestration glue |
| Latency-sensitive production loops | Yes | Fast mode at $10/$50 is now affordable |
| Daily chat + light coding | Optional | Honesty gains help; raw capability bump is modest |
| Hard scientific reasoning | Test first | GPQA dipped slightly; benchmark your own prompts |
| Pure budget-driven, narrow tasks | Maybe not | Gemini 3.5 Flash may serve you cheaper |
The strongest case for upgrading isn't a single benchmark. It's the combination of dynamic workflows, the cheaper Fast mode, and the honesty gains landing together. Each is incremental alone; together they reshape what's practical to delegate to an agent.
If you're on 4.7 and your work is agentic, there's little reason to wait. The standard price is identical, so the only "cost" of upgrading is changing a model string.
How to Use Opus 4.8 in an Agentic Workflow
The fastest way to get value from Opus 4.8 is to point an existing agentic setup at it and let dynamic workflows handle the parallelization you used to wire by hand.
A practical starting pattern:
- Switch your model string to
claude-opus-4-8in your Claude Code config or API calls. - Identify a fan-out task that is naturally parallel, like "add type hints across every module" or "migrate all API routes to the new schema."
- Let Claude plan it rather than scripting the decomposition yourself. Describe the goal and the constraints; the dynamic workflow handles the split.
- Review the verified output. Claude checks subagent results before reporting, but you still own the final merge.
- Watch the token budget. Subagents bill at standard rate, so a 1,000-subagent job is cheap per token but adds up in volume. Use Fast mode for the iterative loops.
For teams formalizing this into a repeatable system, the Autonomous AI Agent Setup workflow gives you the scaffolding (agent roles, handoffs, verification gates) that complements Claude's in-session orchestration. Opus 4.8 manages the micro-parallelism; your workflow design manages the macro-structure.
FAQ
Is Claude Opus 4.8 worth upgrading from 4.7?
For agentic coding, large refactors, or multi-agent pipelines, yes. Dynamic workflows and the SWE-bench Pro lead (69.2% vs 64.3%) are decisive, and standard pricing is unchanged. For daily chat and light coding, the upgrade is optional, though the 4x code-honesty improvement still helps.
How much does Claude Opus 4.8 cost?
Standard pricing is $5 per million input tokens and $25 per million output tokens, identical to Opus 4.7. The optional Fast mode runs ~2.5x faster at $10/$50 per million tokens, which is 3x cheaper than the Fast tier on 4.7.
What are dynamic workflows in Claude Opus 4.8?
Dynamic workflows is a Claude Code feature where Opus 4.8 plans a large task, launches up to 1,000 parallel subagents (16 concurrent) to execute it, verifies their outputs, and reports results with no manual orchestration. It's in research preview for Enterprise, Team, and Max plans.
Does Claude Opus 4.8 beat GPT-5.5?
It depends on the task. Opus 4.8 wins SWE-bench Pro (69.2% vs 58.6%) and professional work benchmarks, but GPT-5.5 still wins Terminal-Bench 2.1 at 78.2% with the Codex CLI. There's no universal winner, so route by workload.
Is Claude Opus 4.8 available on the API?
Yes. It's available immediately via the Claude API as claude-opus-4-8, plus claude.ai, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry.
Sources & References
- Anthropic, Introducing Claude Opus 4.8 (official announcement, May 28, 2026)
- Vellum, Claude Opus 4.8 Benchmarks Explained (independent benchmark breakdown)
- VentureBeat, Anthropic's Claude Opus 4.8 is here with 3x cheaper fast mode
Cover image: Anthropic official Claude Opus 4.8 announcement art.
Use this in a workflow
Opus 4.8 is most powerful as the engine inside a repeatable agent system, not as a one-off chat. Plug it into one of these:
- Autonomous AI Agent Setup: build a team of agents that handle complex operations, now with native parallel subagents.
- App Development Workflow: ship features end to end with Opus 4.8 doing the heavy refactors.
- Research Assistant: point 4.8's higher reasoning accuracy at literature review and synthesis.