Claude Opus 4.8: What Changed, and How I'll Test It

Opus 4.8 is not the main event. It is the model that teaches you how to work before Mythos arrives.

Karo (Product with Attitude)

May 28, 2026

TL;DR

Claude Opus 4.8 is Anthropic’s most capable widely available model, released May 28, 2026, at the same price as Opus 4.7: $5 per million input tokens and $25 per million output tokens. The API model string is `claude-opus-4-8`. The headline upgrade is judgment, not raw intelligence: Anthropic says it is around four times less likely than Opus 4.7 to let flaws in its own code pass unremarked. Three things ship alongside it: effort control on claude.ai, Dynamic Workflows in Claude Code, and a Fast mode that runs at 2.5x speed for three times less than it cost on previous Opus models. Opus 4.8 is a bridge to Claude Mythos, Anthropic’s more powerful model still gated for cybersecurity reasons under Project Glasswing.

Just when I thought I’d get a calm evening with For All Mankind, Opus 4.8 happened.

I went from dynamic space missions to dynamic workflows. I adapted immediately, because my plans are dynamic too.

Here’s what changed in forty-one days.

That’s how long Opus 4.7 lasted before Anthropic moved on to Opus 4.8.

That alone tells me something. This release matters, because it looks like the bridge to what Anthropic is building next.

Hey, I’m Karo Zieminski 🤗

AI Product Manager and builder.

I write Product with Attitude, an AI newsletter for thousands of subscribers developing critical AI literacy the only way it sticks: through practice.

We don’t just use AI. We build workflows, automations, and products with it, while studying how AI itself is built, positioned, and woven into our work.

If you’re new here, welcome! Here’s what you might have missed:

What’s Inside

What Claude Opus 4.8 Is

Claude Opus 4.8 is Anthropic’s most capable widely available model for complex reasoning, long-horizon agentic coding, and high-autonomy work, released on May 28, 2026, at the same regular API price as Opus 4.7.

It builds on Opus 4.7 with improvements across benchmarks, but the real story is simpler: it’s a more effective collaborator, and Anthropic is positioning that as the headline.

For context: Opus 4.7 was already a careful but meaningful upgrade from Opus 4.6: better at long-running agentic work, more rigorous about verifying its own outputs, and dramatically stronger at high-resolution image understanding.

But Opus 4.7 got a “chilly reception” from some users. Opus 4.8 feels like Anthropic answering the 4.7 feedback in public: move fast, keep the price, fix the weak spots.

Claude Opus 4.8 fact sheet

Product with Attitude model card infographic summarizing Claude Opus 4.8, released May 28 2026: it builds on Claude Opus 4.7, costs $5 per million input and $25 per million output tokens in regular mode and $1.25 / $6.25 in Fast Mode, introduces dynamic workflows in Claude Code with planning, parallel subagents and verification, improves complex reasoning, agentic coding and honesty signaling, but shows mixed agentic robustness versus Opus 4.7 and remains weaker than Claude Mythos Preview.

The Workflow Shift

Anthropic calls it “a modest but tangible improvement,” which for an AI launch is refreshingly un-hyped. Someone in marketing was briefly sedated.

Three things ship alongside it.

Effort control on claude.ai.
Dynamic workflows in Claude Code.
Fast mode that now runs at 2.5× speed for 3× less than it cost on previous Opus models.

The price holding flat is a nice surprise. A better model for the same money is a rare thing in this market.

The Honesty Upgrade: Better Uncertainty Calibration

One of the most prominent improvements in Opus 4.8 is its honesty. Opus 4.8 is supposed to answer better and, more importantly, doubt itself better. This addresses a general problem with AI models jumping to conclusions and confidently claiming progress on thin evidence.

The concrete number: Anthropic's evaluations show Opus 4.8 is around four times less likely than Opus 4.7 to let flaws in code it wrote pass unremarked.

A Bridgewater tester told TechCrunch the biggest difference was its tendency to proactively flag issues with the inputs and outputs of an analysis, the kind other models leave for you to catch.

For developers, PMs, and builders, this takes the model one step closer to a review partner. Not a perfect one. Not a replacement for tests. But better.

Dynamic Workflows

Dynamic Workflows is a research-preview feature where Claude plans a large task, spawns tens to hundreds of parallel subagents, and verifies the results before returning a single coordinated answer. Built for migrations, codebase-wide bug hunts, and security audits. It consumes meaningfully more tokens than a normal session.

Dynamic workflows are the product launch that moves Claude Code into new territory. It changes Claude Code from a coding agent you steer into something closer to a temporary engineering team it orchestrates.

Claude plans a large task, breaks it into subtasks, spawns tens to hundreds of parallel subagents, and verifies the results before anything reaches you.

It’s the same pattern I’ve been tracking in my previous analyses of Perplexity Computer.

If you’re on a Max or Team plan, or running Claude Code through the API, dynamic workflows are already on by default.

To get started, ask Claude to create a workflow, or turn on the Claude Code-specific ultracode setting.

If you’re on an Enterprise plan, dynamic workflows launch switched off, but your admin can change this in the Claude Code settings.

The Demo Everyone Will Cite Is Bun

Anthropic needed one example that made dynamic workflows feel real. Bun is that example.

A Bit of Context if You’re Not Deep in Developer Land

Bun is a developer tool used to run and build JavaScript apps. Part engine, part toolbox for web developers.
Bun was originally written in Zig, a fast programming language that gives developers more direct control over the machine. That control is powerful, but it also leaves more room for painful mistakes.
Rust is another programming language that is also fast, but it is especially known for safety. It helps prevent bugs that can crash systems or create security problems.
Porting means rewriting software from one programming language into another while preserving what it already does.

So when Anthropic says Claude helped port Bun from Zig to Rust, it does not mean Claude changed a button color.

It means Claude helped move a serious developer tool from one systems language to another while preserving what it already does.

Roughly 750,000 lines of Rust. 750,000 lines of code means a large-scale software rewrite, not a feature tweak.
Bun already had automated checks that test whether the software still works. After the rewrite, 99.8% of those checks still passed.
Eleven days from start to finish.
Hundreds of agents working in parallel.

Important caveat, and I appreciate Anthropic making this clear: this was not yet in production at the time of its post.

The second caveat matters to our wallets.
Dynamic workflows use more tokens than a normal Claude Code session.

My cost-saving recommendations:

Don’t confuse can do a huge task with should be given a huge task immediately. Start small before pointing them at your whole codebase.
Use the expensive model where the cost of being wrong is higher than the cost of tokens. Everywhere else, drop down a tier.

The Benchmarks, With Caveats

Opus 4.8 improves across benchmarks, especially the agentic-work stack. Here are the numbers Anthropic and the trackers report, atomic enough to lift one at a time.

SWE-Bench Verified: 88.6%. Software engineering on real GitHub issues.
SWE-Bench Pro: 69.2%, up from 64.3% on Opus 4.7. The harder coding set.
Online-Mind2Web: 84%. Computer use and browser-agent tasks. The standout result of the release.
OSWorld-Verified: 83.4%. Anthropic recalculated the Opus 4.7 score under the same new methodology, which makes this comparison cleaner.
GPQA Diamond: 93.6%. Graduate-level science reasoning.
USAMO 2026: 96.7%. Math olympiad.
GDPval-AA: 1890 Elo, up from 1753. Knowledge work.
Terminal-Bench 2.1: 74.6%. Here GPT-5.5 reports higher.

Benchmark comparison table for Claude Opus 4.8 vs Opus 4.7: SWE-bench Pro 64.3 to 69.2 percent, OSWorld-Verified 82.8 to 83.4 percent, GDPval-AA Elo 1753 to 1890, Online-Mind2Web 84 percent, Terminal-Bench 2.1 66.1 to 74.6 percent where Opus 4.8 still loses to GPT-5.5

Coding goes from 64.3% to 69.2%, computer use from 82.8% to 83.4%, knowledge work from 1753 to 1890, and financial analysis from 51.5% to 53.9%.

On Terminal-Bench 2.1, Anthropic footnotes that GPT-5.5's reported 83.4% used the Codex CLI harness, while all models in their table used the Terminus-2 public harness. The number depends on how you run the test.

Now the caveats:

Product with Attitude infographic table titled The Benchmarks, With Human Translation — a plain-English guide to Anthropic's Opus 4.8 benchmark claims across six benchmarks: Super-Agent, Online-Mind2Web, Legal Agent Benchmark, CursorBench, Terminal-Bench 2.1, and OSWorld-Verified, showing what each tests, the Opus 4.8 result, and the caveat.

Remember:
Benchmarks are not just numbers. They are numbers produced by a testing method. When the method changes, the comparison changes too.
I mention this because I noticed that in OSWorld-Verified, Anthropic tested Opus 4.8 using a new methodology and also recalculated Opus 4.7 under that same method.
That’s useful, because it makes the comparison cleaner. But it also means the story changes before independent testers have had time to weigh in.
None of this means the numbers are inaccurate. It means they come with scaffolding: testing setups, scoring rules, tools, assumptions, and recalculations.
And critical AI literacy means understanding this before trusting the leaderboard. I learned this the hard way testing GPT-5.5’s citation reliability myself.

Effort Control: High, Extra, xhigh, Max

Effort control is the product shift in this release. It sits next to the model selector on claude.ai and Cowork, and it lets you choose how hard Claude works on a response. It is available on all plans, including Free.

Opus 4.8 defaults to high effort. On coding tasks, high spends roughly the same tokens as Opus 4.7’s default but performs better.

For harder problems you can choose extra (xhigh in Claude Code) or max, where the model spends more tokens to get better results. Anthropic recommends extra for difficult tasks and long-running asynchronous workflows.

One thing that matters for builders: don’t blindly swap model IDs and assume cost stays flat.

Anthropic recalibrated the effort levels versus Opus 4.7. Medium allows somewhat more thinking, high somewhat less, xhigh substantially more.

The 2026 control syntax also dropped temperature, top_p, and top_k for Opus 4.8, which now return a 400 error.

Fast Mode

Claude Code users can now turn on Fast mode for Opus 4.8 with /fast. Same model, roughly 2.5x faster, and three times cheaper.

On the API, you can contact your account manager to request access or join the waitlist: http://claude.com/fast-mode.

Fast mode is the iteration mode, not the final-judgment mode.

Use it when speed matters: exploring, drafting, trying alternatives, moving through low-risk loops.

For final reviews, risky migrations, security-sensitive work, or anything where the blast radius is bigger than your coffee mug, slow down and inspect.

When to Use What

The cited coverage tells you what Opus 4.8 is. It does not tell you when to reach for which lever. Here is the read I'll be working from.

Claude Opus 4.8 decision table matching six work scenarios to the right setting: lower effort for casual chat, high or extra effort for debugging, Dynamic Workflows for codebase migrations, and max effort when mistakes are costly.

The Mythos Bridge: The Real Story

Opus 4.8 is good. But it’s probably not the main event.

Anthropic is already pointing toward Claude Mythos, a more powerful model currently available only to a small number of organizations for cybersecurity work. The reason it is gated is capability: Mythos is strong enough at offensive cybersecurity tasks that Anthropic is being careful about access.

That makes Opus 4.8 interesting in a different way: it seems to be a bridge model. A way for us to practice the workflows before Mythos arrives.

The tell is in two places: the alignment data and this short paragraph from the announcement:

Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

That paragraph changes how I read this release.

It’s a preparation layer. Better effort control, better dynamic workflows, better self-verification, better uncertainty signaling.

All the habits we need before the next class of models lands.

Pricing

Regular pricing is unchanged from Opus 4.7.

$5 per million input tokens and $25 per million output tokens for regular usage, unchanged from Opus 4.7. Fast mode is $10 input and $50 output at 2.5× speed, now 3× cheaper than fast mode on previous Opus models. Prompt caching saves up to 90%, batch processing up to 50%.

A stronger model at the same regular API price is the part worth underlining.

But dynamic workflows are the exception. They can burn more tokens because they are not one conversation. They are orchestration. More agents. More verification. More parallel work. More ways to accidentally turn a coding session into a big invoice.

Confirmed, Unconfirmed, and Still Untested

Product with Attitude fact-check infographic sorting twelve claims about Claude Opus 4.8 by how trustworthy they are. Confirmed by Anthropic: launched May 28 2026, regular pricing $5/$25 per million tokens, Fast Mode $1.25/$6.25, Claude Code dynamic workflows that plan, split work across hundreds of subagents and verify outputs, improved honesty and uncertainty signaling, that it remains weaker than Claude Mythos Preview, and that it is less robust than Opus 4.7 in several agentic contexts before safeguards. Confirmed with caveats: stronger than Opus 4.7 across nearly all capability evaluations. Provider or platform-specific: OpenRouter lists a 1M-token context window and 128k max output, and GitHub Copilot applies a 15x premium request multiplier until usage-based billing launches. Unconfirmed: the rumored +30% tokenizer improvement. Untested or too broad: that Opus 4.8 beats every other coding model in all workflows.

How I Plan to Test Opus 4.8 Next Week

I’ll start with three practical tests.

1. Honesty under weak evidence.

I want to give Claude a task where “fast answer” and “correct answer” are not the same thing.

Something messy. Something with trade-offs. Something where the model has to compare options, explain uncertainty, and resist the urge to sound confident just because confidence looks pretty in a demo.

For example: a messy article and a vague benchmark claim, then ask what it can verify versus what stays uncertain. The honesty upgrade only counts if it shows up when the evidence is thin.

2. Code review self-doubt.

I'll give it flawed code, ask for a fix, then ask it to review its own fix. The four-times claim is about catching its own mistakes. So I'll make it look.

3. Dynamic workflows on bounded codebase problems.

I’m not going to point it at an entire app and ask it to “make everything better.” That’s not a test.

A real test needs a boundary.

I want to give Claude one contained problem where parallel work actually makes sense, then judge the output against something concrete.

Better tests:

migrate one module from one pattern to another
audit one folder for a specific class of bugs
refactor one subsystem without changing behavior
find duplicated logic across a defined part of the codebase
compare two implementation patterns and recommend one with trade-offs
map risky dependencies before touching the code
generate a test plan before making changes

The key is not just scope, it’s reviewability.

If Claude returns a thousand-line “improvement,” I haven’t tested dynamic workflows. We already know models can produce a lot of code.

The real test is:

Did it surface uncertainty?
Did it split the work sensibly?
Did it verify the result?
Did it preserve existing behavior?
Did it explain what changed and why?

3. Use Fast mode for iteration, not final judgment

Fast mode is useful when, obviously, speed matters. But speed is not the same thing as trust.

I’ll use it for the parts of the workflow where momentum matters more than precision: exploring options, drafting first passes, comparing approaches, rewriting small pieces, and moving through low-risk loops faster.

But I would not use it as the final judge for risky migrations, security-sensitive work, or production changes.

For those tasks, the question I want to answer is: can I trust this enough to ship it?

And that requires inspection.

Availability and Platform Notes

Table comparing where Claude Opus 4.8 appears across six surfaces, with what is known and a builder note for each. Anthropic API: regular pricing is $5 input / $25 output per million tokens and Fast Mode is $1.25 / $6.25, used as the source of truth for Anthropic pricing. Claude Code: dynamic workflows are the main launch feature, best fit for large codebase tasks with verification. GitHub Copilot: Opus 4.8 is generally available with a 15x premium request multiplier until usage-based billing launches, important for teams evaluating real cost. OpenRouter: lists Opus 4.8 with provider metadata including context and output fields, useful for API-intent search but provider specs need caveats. AWS: surfaced as an availability source, so treat platform availability as surface-specific and verify before building launch docs.

Final Thoughts

The next frontier is systems that plan, split work, delegate, verify, and hand results back to humans who still need to know what good looks like.

Which means the job is changing. But the responsibility is not.

FAQ

What changed in Claude Opus 4.8 versus Opus 4.7?
Opus 4.8 builds on Opus 4.7 with better judgment, stronger tool use, longer autonomous runs, and a roughly four-times reduction in letting flaws in its own code pass unremarked. It ships at the same price and adds effort control, Dynamic Workflows, and a cheaper Fast mode.

How much does Claude Opus 4.8 cost?
Regular usage is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.7. Fast mode is $10 input and $50 output at 2.5x speed, three times cheaper than Fast mode on previous Opus models.

What are Dynamic Workflows in Claude Code?
Dynamic Workflows is a research-preview feature that lets Claude plan a large task, run hundreds of parallel subagents in one session, verify the results, and return a coordinated answer. It targets codebase-scale migrations, bug hunts, and security audits, and it uses meaningfully more tokens than a normal session.

What do the effort levels (high, extra, xhigh, max) actually do?
Effort levels control how much the model thinks before answering. Opus 4.8 defaults to high. Extra (xhigh in Claude Code) and max spend more tokens for better results on harder tasks. Anthropic recommends extra for difficult and long-running async work.

Is Claude Opus 4.8 better than GPT-5.5?
On agentic benchmarks like Online-Mind2Web (84%) and Anthropic’s Super-Agent eval, Opus 4.8 leads. On Terminal-Bench 2.1, GPT-5.5 reports higher, though the two used different harnesses, so the comparison depends on test setup. For agentic coding and computer use, Opus 4.8 is the stronger model at launch.

Claude Design Review: 48-Hour Builder's Test + Hero Prompts

Discussion about this post

Ready for more?