In the field of artificial intelligence development, a silent battle is underway over who will hold the keys to security. While companies like Anthropic argue that research into critical vulnerabilities must take place in a closed, controlled environment (so-called gated access), market reality suggests otherwise. Recent testing results show that tools commonly available today can already simulate attacks previously considered the domain of only the most advanced, isolated models.
Mythos and Project Glasswing: An Effort Towards Security Isolation
Anthropic, one of the key players in the market, introduced its initiatives Mythos and Project Glasswing. The goal of these projects is to identify and mitigate risks associated with how models interpret instructions and behave in complex systems. Anthropic claims that if these vulnerabilities were freely available, they could be exploited for massive cyberattacks. Therefore, they advocate for a model where research takes place under strict supervision.
This approach has a clear goal: to create a so-called “moat”, a security moat that separates safe, controlled models from those that can be exploited. For companies in Europe and the Czech Republic, which must adhere to the strict rules of the EU AI Act, this approach is very relevant, as regulations require a high degree of transparency and security for high-risk systems.
Vidoc Security Lab: Testing Reality with GPT-5.4 and Claude 4.6
The team from Vidoc Security Lab decided to test these theoretical barriers in practice. Instead of using closed research tools, they used what ordinary users and developers have at hand: GPT-5.4 from OpenAI and Claude Opus 4.6 from Anthropic.
The results were unambiguous. Researchers were able to reproduce key findings from Anthropic Mythos even with these publicly available models. This means that the fundamental building blocks for exploiting vulnerabilities are not locked behind the "moat" of research projects, but are accessible to anyone with API access to these models. The main problem, therefore, is not the mere existence of vulnerabilities, but the attacker's ability to exploit them effectively and systematically.
Technical Details: What are Parsing and Auth Flaws?
To understand the seriousness of the situation, we need to explain two technical terms that are in focus:
- Parsing flaws (parsing errors): These are situations where an AI model misinterprets structured data (e.g., JSON or XML). If an attacker can "inject" a hidden command into the data processed by the model, they can force the AI to perform an action that was not originally intended (e.g., exfiltrate sensitive information).
- Auth flaws (authentication/authorization errors): These flaws allow bypassing access control. In the context of AI, this could mean that a model that should only have access to public data is persuaded by a specific prompt (instruction) to search for or display data to which it should not have authorization.
Model Comparison: Who Leads in Security?
In the context of current benchmarks, top models operate within very tight margins. While GPT-5.4 dominates in complex programming and logical reasoning, Claude Opus 4.6 exhibits a higher degree of "contextual integrity," meaning it is somewhat more difficult to divert it from its primary security instructions using a prompt. However, as Vidoc's research showed, even Claude is not immune to sophisticated parsing methods.
| Model | Main Advantage | Availability in CZ | Price (approximate) |
|---|---|---|---|
| GPT-5.4 (OpenAI) | Logic, coding, ecosystem | Yes (Web/API) | $20/month (Plus) |
| Claude 4.6 (Anthropic) | Security filters, text | Yes (Web/API) | $20/month (Pro) |
| Gemini 2.0 (Google) | Multimodality, Google integration | Yes | Free / $20 (Advanced) |
Practical Impact: What Does This Mean for You?
For the average user, this means that you should not enter sensitive passwords or private data into chatbots (even the best ones), because technically it is possible to "persuade" these models to leak this data through errors in instruction interpretation.
For companies and developers in the Czech Republic, the warning is even clearer. If you are building an application that uses AI to analyze documents or control internal systems, you cannot rely solely on the model being "safe." You must implement your own layer of data validation (so-called guardrails) that will control what the model parses and what actions it attempts to perform. Within the framework of the European AI Act regulation, responsibility for the security of the implemented system will be a key point for companies during audits.
In the Czech environment, where the number of startups using LLMs for process automation is growing, it is necessary to emphasize security-by-design. Simply connecting to the OpenAI or Anthropic API is not enough; it is necessary to build a robust architecture that isolates these vulnerabilities.
Can my conversation with AI be used to attack my company?
The conversation itself, no, but if your company uses AI to automatically process data (e.g., emails or invoices), an attacker can embed hidden code into this data, which the AI will "misparse" and force it to perform an unauthorized action.
Are Claude and GPT models safer than open-source models like Llama?
Both sides have their advantages. Closed models have stronger built-in filters, but as research shows, they are not impenetrable. Open-source models (e.g., Llama 3) allow companies to have full control over data and infrastructure, which can increase overall security within the corporate network in certain scenarios.