Anthropic's Claude Opus 4.8 beats GPT-5.5 on benchmarks: Handles agents and can admit mistakes

June 2, 2026 Daniel Cesak

  Anthropic has just released Claude Opus 4.8 — the next iteration of its most powerful model — and for the first time positions it as the leader across benchmarks, systematically beating GPT-5.5 from OpenAI. But the new model isn't just about raw performance. Opus 4.8 brings a shift that developers and companies may appreciate even more: it can admit when it's unsure about something. And in the era of agentic AI, where models make decisions independently, this very trait is more critical than another percentage point on a benchmark.

Listen to this article:

What Claude Opus 4.8 can do — and where it beats GPT-5.5

Anthropic introduced the model on May 28, 2026 as a direct response to OpenAI's May release of GPT-5.5. According to the official announcement on Anthropic's website, it is a hybrid reasoning model with a million-token context window, built on the Opus 4.7 architecture but adding noticeably better performance in programming, agentic tasks, and professional knowledge work.

Key benchmark numbers speak clearly. In Terminal-Bench 2.1, Opus 4.8 achieves a score of 80.6% using the standard Terminus-2 test environment. GPT-5.5 scores lower under the same setup (OpenAI reports 83.4%, but only with its own Codex CLI environment). In OSWorld-Verified, a test simulating operating system interaction, Opus 4.8 scores 84.3% — a jump from Opus 4.7's 82.3%.

In the Online-Mind2Web test, which measures a model's ability to control a browser and web applications, Opus 4.8 scored 84% — according to Miguel Gonzalez, tech lead at Browserbase, this is the "biggest leap" compared to both Opus 4.7 and GPT-5.5. In Finance Agent v2, Claude achieved 83.6% (for comparison: Gemini 3.5 Flash scores 57.9%).

Comparison with the previous generation

Compared to Opus 4.7, which arrived in April 2026, the improvement is most noticeable in consistency. While Opus 4.7 suffered from occasional "verbosity" in code comments and imprecise tool calls, version 4.8 eliminates these issues, according to testimony from Scott Wu of the startup Cognition (creator of Devin). "Opus 4.8 uses tools cleanly and follows instructions with the consistency our autonomous engineering workflows require," Wu stated.

Agentic AI with better judgment: Why it matters

The biggest qualitative leap with Opus 4.8 isn't raw performance, but how the model communicates its own uncertainty. Anthropic states that Opus 4.8 is roughly four times less likely to let a code error slip through without comment. The Alignment team noted that the model "reaches new highs in prosocial traits, such as supporting user autonomy and acting in the user's best interest."

In practice, this means that when Claude is unsure about a solution, it says so — instead of "hallucinating" a confident but incorrect answer. For companies deploying AI agents in production, this is a critical parameter. "The biggest difference was Opus 4.8's tendency to proactively flag issues with analysis inputs and outputs — something other models routinely overlooked," described Michael Ran from an investment firm that tested the model on long analytical tasks.

Tom Pritchard, staff engineer at Anthropic, summed it up: "In Claude Code, it asks the right questions, catches its own mistakes, and can say no when a plan doesn't make sense."

New features: Dynamic Workflows and effort control

Along with the model, Anthropic is launching several new features. The most interesting is Dynamic Workflows in Claude Code — a feature that lets the model plan out a large task and then launch hundreds of parallel sub-agents in a single session. These then work independently, Claude verifies their outputs, and reports the result. In practice, this means Claude Code with Opus 4.8 can handle, for example, a codebase migration across hundreds of thousands of lines from start to merge.

The second new feature is effort control directly in the claude.ai and Claude Cowork interface. The user can choose how much compute time and tokens to dedicate to the model. On a higher setting, Claude thinks longer and returns more thorough answers; on a lower one, it responds faster and saves limits. For regular users, this is practical: a basic level suffices for a simple query, while for complex analysis you switch to "extra" or "max."

For developers, a new ability has been added to insert system instructions directly into the message field of the API, allowing model instructions to be modified on the fly without disrupting the prompt cache.

Pricing and availability: What it costs and the situation in Czechia

Anthropic keeps pricing unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Fast mode, where the model runs 2.5× faster, now costs $10 per million input and $50 per million output tokens — which is three times cheaper than previous models.

For end users, Opus 4.8 is available in the Pro subscription for $20 per month ($17 with annual billing), in the Max plan starting at $100, and in Team and Enterprise business plans. The Czech Republic is on the official list of supported countries — Claude is available here both via API and through the web interface at claude.ai.

Anthropic also confirmed it is working on cheaper models with capabilities close to Opus — and also on the release of Mythos-class models, which represent another performance tier. These models are already being tested by several organizations as part of Project Glasswing, and according to the company, should reach the wider public "in the coming weeks."

What it means: The Anthropic vs. OpenAI battle intensifies

The release of Opus 4.8 comes at a time when Anthropic has surpassed OpenAI in market valuation — at $965 billion after its latest Series H funding round, it is currently the world's most valuable AI startup. At the same time, the company filed a confidential IPO application (draft S-1), suggesting an imminent stock market entry.

For Czech developers and businesses, this rivalry means one thing: pressure on both quality and price. If just a year ago the GPT vs. Claude battle was more of an academic debate, today it's a practical decision — which model to use for company agents, which for code development, and which for customer communication.

Opus 4.8 shows that Anthropic is betting on decision quality, not just raw intelligence. A model that catches its own mistakes is, for agentic workflows — where AI works for hours without human oversight — far more valuable than a model that is a percentage point better on a test but silently fails.

And that's exactly where the main story of Opus 4.8 lies. It's not just another iteration. It's a model that for the first time elevates honesty to the level of performance — and in doing so, changes the rules of the game.

Do I have to pay more for Opus 4.8? Have subscription prices increased?

No, prices remain the same as Opus 4.7. The API costs $5 per million input and $25 per million output tokens. The Pro subscription remains at $20 per month, Max starting at $100. Fast mode is actually three times cheaper than before.

Does Claude Opus 4.8 support Czech and how well does it handle it?

Yes, Claude models have long supported Czech at a very good level. Anthropic does not officially list Czech among its primary languages, but in practice the model communicates fluently in Czech, understands Czech realia, and handles technical terminology as well. The Czech Republic is also on the official list of supported countries.

When will Claude Mythos be available for regular users?

Anthropic says that Mythos-class models should reach the wider public "in the coming weeks." They are currently being tested by several dozen organizations as part of Project Glasswing, primarily on cybersecurity tasks. An exact public release date has not yet been announced.