Skip to main content

Security Holes in Image Generation: How Researchers Bypass ChatGPT Filters

Ilustrační obrázek
Security risk analysis in generative AI
Researchers from British startup Mindgard have uncovered a critical vulnerability in OpenAI's latest models. Despite advanced filters, harmless-sounding instructions (prompts) can be used to make ChatGPT generate graphic violence and explicit content. This finding brings the question of AI safety layer reliability to the forefront, especially in the context of strict European regulation.

The world of artificial intelligence keeps moving forward, but with every step, the complexity of its shadows also grows. While OpenAI strives to integrate advanced image generation capabilities directly into its chatbot, security experts warn that protection against inappropriate content is still not impenetrable. According to reports from BBC, it has been shown that even the most modern models, such as the current GPT-5.4, can be coaxed into creating content that should be strictly prohibited through subtle prompt modifications.

Mindgard and "red-teaming": How are AI barriers broken?

Mindgard specializes in so-called red-teaming. In the context of artificial intelligence, this is a process where experts deliberately attack a model, search for its weaknesses, and try to "persuade" the AI to violate its own safety protocols. This process is crucial for developing safe systems, but at the same time shows how easy it can be to bypass even the most robust filters.

Jim Nightingale, an AI safety researcher at Mindgard, stated that the test results were shocking. Researchers found that simply slightly modifying an instruction originally intended for creating humorous images causes the system to start generating graphic violence or sexually explicit scenes. What is most concerning about this finding is the fact that the prompt itself can look completely innocent. The AI then creates these images "on its own," meaning without a direct command to produce violence, but as a result of poor contextual understanding or failure of safety layers.

Mindgard founder Peter Garraghan emphasized that these scenes can be very frightening. Even though OpenAI claims to implement multiple layers of protection, researchers have proven that with small changes to the text, these barriers can still be bypassed.

Safety comparison: OpenAI vs. the competition

In the battle for dominance in the generative AI space, safety is becoming a key benchmark. If we compare OpenAI's approach with its main competitors, we see different strategies:

  • OpenAI (ChatGPT/DALL-E): Relies on a combination of text and image filters. As the GPT-5.4 case shows, their system is context-sensitive but prone to being "broken" through creative prompt interpretation.
  • Anthropic (Claude): Their approach based on the concept of Constitutional AI tries to impose a set of ethical principles directly into the model's core. Claude is generally considered a more conservative and "safer" model, although this does not mean it is immune to all attacks.
  • Google (Gemini): Google uses massive datasets to train safety filters and integrates them across its entire ecosystem. Gemini has very strict restrictions on generating real people, leading to frequent refusal responses that some users find overly restrictive.

Practical impact for users and businesses in the Czech Republic

What does this mean for the average Czech user or a business that uses AI? Above all, it is important to realize that ChatGPT Plus, which enables image generation via DALL-E 3, costs approximately $20 per month (roughly 470–490 CZK). When using this tool for creating marketing materials or visual concepts, businesses must account for the risk that AI may generate content that would be perceived as inappropriate or problematic.

For Czech businesses, the legislative framework is also crucial. As EU members, we are subject to the rules of the EU AI Act. This regulation places great emphasis on the safety and transparency of high-risk systems. If a company in the Czech Republic were to use tools that generate harmful or disinforming content (for example, synthetic photographs of political figures or violent scenes), it could find itself in legal trouble regarding compliance with safety and ethics standards.

The threat of disinformation and digital manipulation

The problem is not just in "bad" images, but also in their ability to spread disinformation. As confirmed in the case of the fake Pentagon explosion, synthetic images can cause real chaos in markets or public opinion. In the context of elections and political crises, AI's ability to create convincing but false visual evidence represents one of the greatest challenges for digital security today.

OpenAI states that it is implementing additional safeguards against problematic topics, but as Mindgard's research shows, it is a constant race between developers and those who seek ways to abuse the systems.

Is generating inappropriate content in ChatGPT prohibited?

Yes, OpenAI's terms of service strictly prohibit the creation of sexual or violent content. However, research shows that there are ways (so-called jailbreaking) to bypass these prohibitions using specific prompts.

How can I minimize the risk of poorly generated content within my company?

It is recommended to use models with a clearly defined "constitution" (such as Claude from Anthropic) and to always perform a human review (human-in-the-loop) of every visual output before publishing or using it in commercial communication.

Is ChatGPT available in Czech for image generation?

Entering prompts in Czech is possible, but DALL-E 3 internally interprets English descriptions better. For the best results and the highest level of control over detail, it is still recommended to use English.

X

Don't miss out!

Subscribe for the latest news and updates.