Listen to this article:
Ten-round test revealed the Achilles' heel
Independent testing published on June 3, 2026 subjected the Claude Opus 4.8 and 4.7 models to identical scenarios across four professional domains: programming, medicine, finance, and law. The goal was to determine how often the models fabricate information (hallucinate) and whether they admit their own limits when they don't know the answer.
The result? In coding and medical queries, Opus 4.8 performed equally well or better than version 4.7. But with legal questions came a breaking point — the newer model showed more errors and less frequently signaled uncertainty. In other words: where it should have said "I'm not sure about that," it preferred to answer — and often incorrectly.
The testing methodology cross-validated results with outputs from other AI systems to distinguish Claude-specific errors from general language model limitations. The failures therefore directly concern the Opus 4.8 architecture, not the entire field.
The honesty paradox: Anthropic promised the opposite
When Anthropic launched Claude Opus 4.8 on May 28, 2026, "honesty" was the main marketing hook. The company claimed the model is "approximately 4× less likely to let code errors pass unnoticed" and that it "more often points out uncertainties in its work." CEO Dario Amodei repeatedly emphasized that Claude is a safer alternative to competing models from OpenAI and Google.
But the legal test results paint a different picture. A model that is supposed to be more honest fails more in law than its predecessor. This is not just a technical problem, but a reputational one — Anthropic built its brand on the principle of "constitutional AI" and safety. If independent tests show regression in an area where accuracy is critical, the credibility of the entire mission takes a hit.
Why law specifically? The specifics of legal reasoning
The legal domain is exceptionally challenging for language models. While in programming an error is binary (code either works or it doesn't) and in medicine there are relatively fixed diagnostic trees, legal analysis requires weighing precedents, jurisdiction-specific rules, and contextual interpretation. It's not about knowing the statute — it's about understanding which statute to apply in a specific situation and how it interacts with other norms.
Experts speculate that Claude Opus 4.8 may have been optimized for certain types of tasks (such as agentic coding, where it achieved a SWE-Bench Pro score of 69.2%), but at the cost of degraded capabilities in other areas. This phenomenon — where performance improvements in one set of benchmarks weaken others — is a known problem in LLM development. Meta, for example, encountered it while tuning Llama models.
What this means for companies and legal departments
For enterprises considering deploying AI for due diligence, contract review, preparation of regulatory filings, or clause analysis, these findings represent a fundamental complication. Regression between versions — meaning the newer model fails at what the older one could do — undermines trust in consistent performance across updates.
If version 4.8 fails on legal queries that 4.7 handled, what guarantee does a company have that version 4.9 won't bring further regressions? For legal departments of Fortune 500 companies that are gradually testing AI, this is a critical question for risk assessment.
In the European context, the situation is even more sensitive. The EU AI Act classifies the use of AI in legal and regulatory contexts as high-risk. Deploying a model with documented failures in the legal domain could mean regulatory penalties. For Czech law firms and corporate legal teams considering AI deployment, this means: test on your own data, don't rely on general benchmarks.
Anthropic remains silent — and that's a problem
A key unknown remains how Anthropic will respond to the independent test results. The company has yet to issue any official statement on the legal failure of Opus 4.8. A transparent explanation of what changed between versions 4.7 and 4.8 would strengthen trust. Silence, on the other hand, fuels speculation that the safety rhetoric did not prevent the same "corner-cutting" that plagues the competition.
For comparison: OpenAI publishes detailed system cards with independent audit results for its models. Google similarly publishes detailed benchmarks across domains. If Anthropic wants to remain a leader in safe AI, such transparency should be a given.
Broader lessons: benchmarks versus reality
The Claude Opus 4.8 case illustrates a fundamental problem across the AI industry: academic benchmarks don't reflect real-world deployment. A model can excel in abstract reasoning tests, but if it fails in a practical legal scenario, benchmark scores are worthless. For enterprise buyers, domain-specific testing is critical — and it was precisely this that exposed a problem in Opus 4.8 that would otherwise have remained hidden.
It's also unclear whether the failures apply to all types of legal reasoning or only specific subdomains. Contract interpretation differs from tort law analysis, which in turn differs from regulatory compliance. A detailed breakdown of failures would help companies map safe use cases. Without it, we're left with general warnings that may be overblown — or conversely, insufficient.
Is Claude Opus 4.8 safe for everyday use even though it fails in legal tests?
For most everyday tasks — writing texts, data analysis, programming — Opus 4.8 is fully usable and in many respects better than previous versions. The problem specifically concerns the legal domain. If you're not a lawyer and don't use the model for legal analysis, the failure likely doesn't affect you.
How does Claude Opus 4.8 compare on price versus the competition?
API prices remained the same as Opus 4.7: $5 per million input tokens and $25 per million output tokens. For comparison: GPT-5.5 from OpenAI costs $15 per million input and $75 per million output tokens — Claude is therefore roughly a third of the price. According to Artificial Analysis, Opus 4.8 also needs 35% fewer output tokens than version 4.7, so real costs may be even lower.
Can a Czech company test Claude Opus 4.8 before deploying it into production?
Yes. Claude is available through the web interface claude.ai with a free tier (limited daily queries) and via API. For enterprise testing, we recommend the Team plan ($30/month/user) or Enterprise. Claude communicates reasonably well in Czech, but for legal texts we recommend working in the language of the source documents — typically English or Czech. For any legal deployment, however, outputs must always be reviewed by an expert.