Claude Opus 4.8: What Changed, and How I'll Test It
Opus 4.8 is not the main event. It is the model that teaches you how to work before Mythos arrives.
TL;DR
Claude Opus 4.8 is Anthropic’s most capable widely available model, released May 28, 2026, at the same price as Opus 4.7: $5 per million input tokens and $25 per million output tokens. The API model string is claude-opus-4-8. The headline upgrade is judgment, not raw intelligence: Anthropic says it is around four times less likely than Opus 4.7 to let flaws in its own code pass unremarked. Three things ship alongside it: effort control on claude.ai, Dynamic Workflows in Claude Code, and a Fast mode that runs at 2.5x speed for three times less than it cost on previous Opus models. Opus 4.8 is a bridge to Claude Mythos, Anthropic’s more powerful model still gated for cybersecurity reasons under Project Glasswing.
Just when I thought I’d get a calm evening with For All Mankind, Opus 4.8 happened.
I went from dynamic space missions to dynamic workflows. I adapted immediately, because my plans are dynamic too.
Here’s what changed in forty-one days.
That’s how long Opus 4.7 lasted before Anthropic moved on to Opus 4.8.
Hey, I’m Karo Zieminski 🤗
AI Product Manager and builder.
I write Product with Attitude, an AI newsletter for thousands of subscribers developing critical AI literacy the only way it sticks: through practice.
Why trust this analysis: I’m looking at Claude Opus 4.8 from the builder’s seat: what changed, what breaks, what costs more, what needs testing, and what should not be trusted.
If you’re new here, welcome! Here’s what you might have missed:
What’s Inside
What Claude Opus 4.8 Is and What Builders Need to Know
Claude Opus 4.8 is Anthropic’s most capable widely available model for complex reasoning, long-horizon agentic coding, and high-autonomy work, released on May 28, 2026, at the same regular API price as Opus 4.7.
Opus 4.8 is a more effective collaborator, and Anthropic is positioning that as the headline.
It’s worth paying attention to if your work depends on reasoning, code review, browser-agent tasks, long-running Claude Code sessions, or model self-checking.
It builds on Opus 4.7 with improvements across benchmarks.
But Opus 4.7 got a “chilly reception” from some users. Opus 4.8 feels like Anthropic answering the 4.7 feedback in public: move fast, keep the price, fix the weak spots.
Claude Opus 4.8 fact sheet

The Workflow Shift
Anthropic calls it “a modest but tangible improvement,” which for an AI launch is refreshingly un-hyped. Someone in marketing was briefly sedated.
Three things ship alongside it.
Effort control on claude.ai.
Dynamic workflows in Claude Code.
Fast mode that now runs at 2.5× speed for 3× less than it cost on previous Opus models.
The price holding flat is a nice surprise. A better model for the same money is a rare thing in this market.
The product story is this: Opus 4.8 is less about one clever answer and more about controllable work. We get knobs for effort, speed, and orchestration. That’s where the real builder value is hiding. Annoyingly, also where the real token burn is hiding.
The Honesty Upgrade: Better Uncertainty Calibration
One of the most prominent improvements in Opus 4.8 is its honesty.
Opus 4.8 is supposed to answer better and, more importantly, doubt itself better.
This addresses a general problem with AI models jumping to conclusions and confidently claiming progress on thin evidence.
The concrete number: Anthropic's evaluations show Opus 4.8 is around four times less likely than Opus 4.7 to let flaws in code it wrote pass unremarked.
A Bridgewater tester told TechCrunch the biggest difference was its tendency to proactively flag issues with the inputs and outputs of an analysis, the kind other models leave for you to catch.
For developers, PMs, and builders, this takes the model one step closer to a review partner. Not a perfect one. Not a replacement for tests. But better.
Dynamic Workflows in Claude Code
Dynamic Workflows is a research-preview feature where Claude plans a large task, spawns tens to hundreds of parallel subagents, and verifies the results before returning a single coordinated answer. Built for migrations, codebase-wide bug hunts, and security audits. It consumes meaningfully more tokens than a normal session.
Dynamic workflows are the product launch that moves Claude Code into new territory. From a coding agent you steer into a temporary engineering team it orchestrates.
Claude plans a large task, breaks it into subtasks, spawns tens to hundreds of parallel subagents, and verifies the results before anything reaches you.
It’s the same pattern I’ve been tracking in my previous analyses of Perplexity Computer.
If you’re on a Max or Team plan, or running Claude Code through the API, dynamic workflows are already on by default.
To get started, ask Claude to create a workflow, or turn on the Claude Code-specific ultracode setting.
If you’re on an Enterprise plan, dynamic workflows launch switched off, but your admin can change this in the Claude Code settings.
The Demo Everyone Will Cite Is Bun
Anthropic needed one example that made dynamic workflows feel real.
Bun is that example.
A Bit of Context if You’re Not Deep in Developer Land
Bun is a developer tool used to run and build JavaScript apps. Part engine, part toolbox for web developers.
Bun was originally written in Zig, a fast programming language that gives developers more direct control over the machine. That control is powerful, but it also leaves more room for painful mistakes.
Rust is another programming language that is also fast, but it is especially known for safety. It helps prevent bugs that can crash systems or create security problems.
Porting means rewriting software from one programming language into another while preserving what it already does.
So when Anthropic says Claude helped port Bun from Zig to Rust, it does not mean Claude changed a button color.
It means Claude helped move a serious tool from one systems language to another while preserving what it already does.
Roughly 750,000 lines of Rust. 750,000 lines of code means a large-scale software rewrite, not a feature tweak.
Bun already had automated checks that test whether the software still works. After the rewrite, 99.8% of those checks still passed.
Eleven days from start to finish.
Hundreds of agents working in parallel.
Important caveat, and I appreciate Anthropic making this clear: this was not yet in production at the time of its post.
The second caveat matters to our wallets. Dynamic workflows use more tokens than a normal Claude Code session.
My cost-saving recommendations:
Don’t confuse can do a huge task with should be given a huge task immediately. Start small before pointing them at your whole codebase.
Use the expensive model where the cost of being wrong is higher than the cost of tokens.
Everywhere else, drop down a tier.
The Benchmarks, With Caveats for Claude Opus 4.8: SWE-Bench, OSWorld, GDPval, and Terminal-Bench
Source note: I’m separating three things here: confirmed Anthropic release details, launch-day benchmark claims, and what I still need to test myself.
Opus 4.8 improves across benchmarks, especially the agentic-work stack.
Here are the numbers Anthropic and the trackers report, atomic enough to lift one at a time.
SWE-Bench Verified: 88.6%. Software engineering on real GitHub issues.
SWE-Bench Pro: 69.2%, up from 64.3% on Opus 4.7. The harder coding set.
Online-Mind2Web: 84%. Computer use and browser-agent tasks. The standout result of the release.
OSWorld-Verified: 83.4%. Anthropic recalculated the Opus 4.7 score under the same new methodology, which makes this comparison cleaner.
GPQA Diamond: 93.6%. Graduate-level science reasoning.
USAMO 2026: 96.7%. Math olympiad.
GDPval-AA: 1890 Elo, up from 1753. Knowledge work.
Terminal-Bench 2.1: 74.6%. Here GPT-5.5 reports higher.
Coding goes from 64.3% to 69.2%, computer use from 82.8% to 83.4%, knowledge work from 1753 to 1890, and financial analysis from 51.5% to 53.9%.
On Terminal-Bench 2.1, Anthropic footnotes that GPT-5.5's reported 83.4% used the Codex CLI harness, while all models in their table used the Terminus-2 public harness. The number depends on how you run the test.
Now the caveats:
Remember:
Benchmarks are not just numbers. They are numbers produced by a testing method. When the method changes, the comparison changes too.
I mention this because I noticed that in OSWorld-Verified, Anthropic tested Opus 4.8 using a new methodology and also recalculated Opus 4.7 under that same method.
That’s useful, because it makes the comparison cleaner. But it also means the story changes before independent testers have had time to weigh in.
None of this means the numbers are inaccurate. It means they come with scaffolding: testing setups, scoring rules, tools, assumptions, and recalculations.
And critical AI literacy means understanding this before trusting the leaderboard. I learned this the hard way testing GPT-5.5’s citation reliability myself.
Effort Control: High, Extra, xhigh, Max in Claude Opus 4.8
Effort control is the product shift in this release. It sits next to the model selector on claude.ai and Cowork, and it lets you choose how hard Claude works on a response. It is available on all plans, including Free.
Opus 4.8 defaults to high effort. On coding tasks, high spends roughly the same tokens as Opus 4.7’s default but performs better.
For harder problems you can choose extra (xhigh in Claude Code) or max, where the model spends more tokens to get better results. Anthropic recommends extra for difficult tasks and long-running asynchronous workflows. For builders: don’t blindly swap model IDs and assume cost stays flat.
Anthropic recalibrated the effort levels versus Opus 4.7.
Medium allows somewhat more thinking, high somewhat less, xhigh substantially more.
The 2026 control syntax also dropped temperature, top_p, and top_k for Opus 4.8, which now return a 400 error.
Fast Mode in Claude Opus 4.8: When Speed Helps and When It Hurts
Claude Code users can now turn on Fast mode for Opus 4.8 with /fast. Same model, roughly 2.5x faster, and three times cheaper.
On the API, you can contact your account manager to request access or join the waitlist: http://claude.com/fast-mode.
Fast mode is the iteration mode, not the final-judgment mode.
Use it when speed matters: exploring, drafting, trying alternatives, moving through low-risk loops.
For final reviews, risky migrations, security-sensitive work, or anything where the blast radius is bigger than your coffee mug, slow down and inspect.
When to Use What
The cited coverage tells you what Opus 4.8 is. It does not tell you when to reach for which lever. Here is the read I'll be working from.
Simple rule: pay for Opus when the cost of being wrong is higher than the cost of tokens. Use cheaper modes when momentum matters more than precision. And when the model starts acting like an engineering team, manage it like one.
The Mythos Bridge: The Real Story Behind Claude Opus 4.8
Opus 4.8 is good.
But it’s probably not the main event.
Anthropic is already pointing toward Claude Mythos, a more powerful model currently available only to a small number of organizations for cybersecurity work. The reason it is gated is capability: Mythos is strong enough at offensive cybersecurity tasks that Anthropic is being careful about access.
The tell is in two places: the alignment data and this short paragraph from the announcement:
Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.
That paragraph changes how I read this release.
Opus 4.8 seems to be a bridge model and the rehearsal space for higher-autonomy work: effort control, parallel agents, self-verification, uncertainty signaling, and human review loops.
We get to practice the workflows before Mythos arrives and build the habits we’ll need before the next class of models lands.
Pricing for Claude Opus 4.8: Regular API, Fast Mode, Prompt Caching, and Batch Costs
Regular pricing is unchanged from Opus 4.7.
$5 per million input tokens and $25 per million output tokens for regular usage, unchanged from Opus 4.7. Fast mode is $10 input and $50 output at 2.5× speed, now 3× cheaper than fast mode on previous Opus models. Prompt caching saves up to 90%, batch processing up to 50%.
A stronger model at the same regular API price is the part worth underlining.
But dynamic workflows are the exception. They can burn more tokens because they are not one conversation. They are orchestration. More agents. More verification. More parallel work. More ways to accidentally turn a coding session into a big invoice.
Confirmed, Unconfirmed, and Still Untested
How I Plan to Test Opus 4.8 Next Week as a Claude Opus 4.8 Review Plan
This is the testing method I want more AI users to bring into their work: define the task, define the failure mode, define what evidence would change your mind.
1. Honesty under weak evidence.
I want to give Claude a task where “fast answer” and “correct answer” are not the same thing.
Something messy. Something with trade-offs. Something where the model has to compare options, explain uncertainty, and resist the urge to sound confident.
For example: feed it a messy article and a vague benchmark claim, then ask what it can verify versus what stays uncertain.
The honesty upgrade only counts if it shows up when the evidence is thin.
2. Code review self-doubt.
I'll give it flawed code, ask for a fix, then ask it to review its own fix. The four-times claim is about catching its own mistakes. So I'll make it look.
3. Dynamic workflows on bounded codebase problems.
I’m not going to point it at an entire app and ask it to “make everything better.” That’s not a test.
A real test needs a boundary.
I want to give Claude one contained problem where parallel work actually makes sense, then judge the output against something concrete.
Better tests:
migrate one module from one pattern to another
audit one folder for a specific class of bugs
refactor one subsystem without changing behavior
find duplicated logic across a defined part of the codebase
compare two implementation patterns and recommend one with trade-offs
map risky dependencies before touching the code
generate a test plan before making changes
The key is not just scope, it’s reviewability.
If Claude returns a thousand-line “improvement,” I haven’t tested dynamic workflows. We already know models can produce a lot of code.
The real test is:
Did it surface uncertainty?
Did it split the work sensibly?
Did it verify the result?
Did it preserve existing behavior?
Did it explain what changed and why?
3. Use Fast mode for iteration, not final judgment
Fast mode is useful when, obviously, speed matters. But speed is not the same thing as trust.
I’ll use it for the parts of the workflow where momentum matters more than precision: exploring options, drafting first passes, comparing approaches, rewriting small pieces, and moving through low-risk loops faster.
But I would not use it as the final judge for risky migrations, security-sensitive work, or production changes.
For those tasks, the question I want to answer is: can I trust this enough to ship it?
And that requires inspection.
Availability and Platform Notes
Opus 4.8 is available on claude.ai, Claude Code, and through the Claude API. Platform details matter because not every surface exposes the same controls in the same way. Claude Code gets Dynamic Workflows and /fast; claude.ai and Cowork get effort control; API users need to watch model IDs, Fast mode access, cache behavior, and unsupported sampling parameters. Read the docs before you swap IDs and blame the model for a broken integration.
Final Thoughts
The next frontier is systems that plan, split work, delegate, verify, and hand results back to humans who still need to know what good looks like.
Which means the job is changing. But the responsibility is not.
The next phase will reward people who stop asking bigger models vague questions and start designing workflows that make model work inspectable. Fewer black boxes. More systems. Good.
FAQ About Claude Opus 4.8
What changed in Claude Opus 4.8 versus Opus 4.7?
Claude Opus 4.8 adds better judgment, stronger tool use, longer autonomous runs, effort control, Dynamic Workflows, and cheaper Fast mode. Anthropic also says it is roughly four times less likely than Opus 4.7 to let flaws in its own code pass unremarked.
How much does Claude Opus 4.8 cost?
Claude Opus 4.8 keeps the regular Opus price: $5 per million input tokens and $25 per million output tokens. Fast mode costs $10 input and $50 output, runs about 2.5x faster, and is three times cheaper than previous Opus Fast mode pricing.
What are Dynamic Workflows in Claude Code?
Dynamic Workflows let Claude plan a large task, run many parallel subagents, verify results, and return one coordinated answer. They are built for codebase migrations, bug hunts, security audits, dependency mapping, and other large Claude Code workflows.
What do Claude Opus 4.8 effort levels do?
Effort levels control how much Claude thinks before answering. Opus 4.8 defaults to high. Extra, xhigh in Claude Code, and max use more tokens for harder tasks, long-running work, and higher-quality reasoning.
Is Claude Opus 4.8 better than GPT-5.5?
For agentic coding and computer-use benchmarks, Claude Opus 4.8 looks stronger at launch. It leads on Online-Mind2Web and Anthropic’s Super-Agent eval. GPT-5.5 reports higher on Terminal-Bench 2.1, but the test harnesses differ, so direct comparison needs caution.
Is Claude Opus 4.8 worth upgrading to from Opus 4.7?
Yes, if you use Claude for coding, code review, browser-agent tasks, long-running Claude Code work, or analysis where uncertainty matters. For simple drafting and summarization, cheaper models may still be the better default.
What is the Claude Opus 4.8 API model ID?
The Claude Opus 4.8 API model ID is claude-opus-4-8. API users should check model IDs, Fast mode access, cache behavior, and unsupported sampling parameters before swapping IDs.
Does Claude Opus 4.8 support a 1M token context window?
Yes. Claude Opus 4.8 supports a 1M token context window on Claude API, Amazon Bedrock, and Vertex AI, with 200k on Microsoft Foundry. Max output is 128k tokens. Long context helps, but structured inputs still matter.
When should I use Dynamic Workflows instead of a normal Claude Code session?
Use Dynamic Workflows for tasks that split cleanly into parallel subtasks and can be verified: migrations, audits, bug hunts, duplicated logic searches, dependency mapping, and bounded refactors. Use a normal Claude Code session for small, linear, exploratory, or low-token tasks.
Should I use Fast mode for production code review?
No. Use Fast mode for iteration, exploration, drafts, and low-risk loops. For production code review, security-sensitive work, risky migrations, or anything with real blast radius, use slower modes, inspect the output, run tests, and make Claude show its work.
Is Claude Opus 4.8 the same as Claude Mythos?
No. Opus 4.8 is widely available. Claude Mythos Preview is a more powerful model class currently limited to selected cybersecurity organizations under Project Glasswing. Opus 4.8 is the rehearsal model. Mythos is the next act.
What is the best way to test Claude Opus 4.8?
Test Claude Opus 4.8 on bounded AI workflows where success and failure are visible. Define the task, failure mode, and evidence that would change your mind. Then compare normal mode, Fast mode, effort levels, and Dynamic Workflows against real builder work.
You Might Also Enjoy
WHY SUBSCRIBE ・YOUR BENEFITS・ TOOLS I BUILT・CLAUDE HUB・PERPLEXITY HUB ・VIBE CODING HUB
















Thanks, Karo. I have to say, your launch deep dives are some of the most interesting ones I read. You always help me understand what to really pay attention to.
Framing 4.8 as the model that teaches you the workflow before Mythos is a sharp read on a point release that looks modest on benchmarks.
The 4x drop in letting its own code flaws slide matters more to me than raw capability, since self-review is where my agents bleed tokens and trust.
What's your threshold for deciding Dynamic Workflows earned their token cost rather than burned it? I log the same question against my own builds at theaifounder.substack.com.