What the Mindgard research uncovered
The British startup Mindgard, which specializes in so-called red-teaming — finding ways to make AI models bypass their own safety rules — discovered a disturbing vulnerability in ChatGPT. All it took was taking a publicly shared prompt, originally designed for creating humorous images, and tweaking it slightly. The result was graphic scenes that, according to Mindgard founder Peter Garraghan (also a professor at Lancaster University), were described as "very scary, sometimes sexualized, sometimes both at once."
Researcher Jim Nightingale, who discovered the vulnerability, said he was "shocked and in tears" by what ChatGPT was able to generate. The BBC, which had the opportunity to see the images, described, for example:
- A dead young woman in a crop top and shorts with a bloodied face — ChatGPT titled the image "Grim crime scene aftermath"
- A young woman bound and gagged in a dirty room — the title read "abandoned in fear and restraint"
- A man with extensive head injuries
- Images depicting sexual posing and nudity
Most disturbing was that the prompt did not specify the subject of the images — the AI generated them "of its own volition," as Garraghan put it. "This is a perfectly innocent-looking instruction for AI, but the consequence is the generation of very, very bad images and content," he added.
How can safety filters be bypassed?
The principle of the attack, known in the field as a jailbreak, involves finding a phrasing that confuses the model. ChatGPT has several layers of protection — text classifiers that block harmful requests before generation, and image filters that check the output. But models don't understand intent the way humans do.
"Models don't understand intent. They don't understand context. They don't understand what is appropriate or what is right and wrong," Dr. Rumman Chowdhury, an expert in AI model evaluation and director of Humane Intelligence, explained to the BBC. She likened the whole situation to a "cat-and-mouse game" — as soon as safety protections improve, more sophisticated methods of overcoming them emerge.
In response, OpenAI stated that "after investigating this trend, we introduced additional safeguards against this type of prompt." The company also emphasized that it has several layers of protection — upstream rejection (blocking before generation), downstream blocking (checking the output with a safety model), and a combination of automated systems with human oversight.
But according to Mindgard researchers, making further minor tweaks to the prompt caused the vulnerability to work again. OpenAI initially responded to the alert from May 2026 with only an automated reply — more significant measures came only after the BBC's intervention.
What OpenAI's own data says
In the official system card for ChatGPT Images 2.0, which OpenAI published this April, the company admits that even its most advanced safety stack is not 100% effective. When tested with 3,112 adversarial prompts:
- In instant mode, 3.9% of harmful images passed through both layers of protection (combined detection success rate of 96.1%)
- In "thinking" mode (which uses reasoning for better quality), as many as 12.5% of harmful outputs got through
This means that even in targeted testing, where OpenAI knows what the threats are, some harmful content slips through the protection. In real-world operation, where attackers are constantly finding new paths, the situation could be even worse.
Broader context: AI models are not human
The problem is not isolated to ChatGPT. Last year's research by the UK AI Safety Institute (AISI) found that jailbreaks were able to overcome safety barriers across all tested AI systems. The UK's Department for Science, Innovation and Technology commented that "safeguards in AI models are improving, but there is still a lot of work ahead of us."
The core of the problem lies in the data on which large language models are trained. ChatGPT and other models learn from millions of images downloaded from the internet — and that includes violent, sexual, or otherwise problematic material. As researcher Nightingale noted: "Even though what I saw was generated, an artificial image, it has ties to real images and the real world."
What this means for Czech users and businesses
ChatGPT is one of the most widely used AI tools in the Czech Republic, both among individuals and in companies. Although OpenAI does not officially allow access to paid plans directly in Czech koruna (ChatGPT Plus costs $20/month, Pro $200/month), Czech users commonly use the service via international payment cards.
Mindgard's findings are relevant for anyone who uses ChatGPT — especially for parents, schools, and businesses that deploy AI tools in environments where minors also have access. OpenAI claims that ChatGPT is intended for users aged 13 and older (with restrictions until age 18 without parental consent), but in practice, age verification is minimal.
In the European context, the topic gains importance also due to the EU AI Act, which since February 2025 has introduced stricter regulation for high-risk AI systems. Generative models like ChatGPT fall under the rules for "general-purpose AI," where manufacturers must demonstrate an adequate level of safety and transparency. The Mindgard case shows how difficult it is to define, and especially maintain, this "adequate level."
Mindgard: Who is behind the research
Mindgard is not an academic project but a commercial AI security startup that offers red-teaming as a service to companies. Its founder Peter Garraghan also serves as a professor at Lancaster University, giving the research solid academic grounding. The company states on its website that it helps organizations "identify, test, and fix security vulnerabilities in AI systems before attackers discover them."
It is precisely this model — independent security companies uncovering holes in tech giants' products — that has become the standard in recent years. The aforementioned UK AISI or the US NIST operate similarly. For users, this is good news: the security community actively hunts for problems, thanks to which tools gradually improve. The bad news is that the pace of discovering vulnerabilities currently outstrips the pace of fixing them.
Is ChatGPT safe for children and minors?
OpenAI states that ChatGPT is intended for users aged 13 and older, with minors under 18 requiring parental consent. However, Mindgard's research shows that safety filters are not bulletproof. Parents should consider whether to allow children access to image generation features, and potentially use parental control tools if available.
How can I tell if an image was generated by AI?
ChatGPT Images 2.0 contains an invisible watermark and metadata in accordance with the C2PA standard, which makes it possible to verify the origin of an image. However, an average user cannot detect this without special tools. OpenAI is also collaborating with the industry to improve transparency — despite this, verifying the origin of AI images remains challenging in practice.
Can the EU AI Act prevent similar incidents?
The EU AI Act requires developers of general-purpose AI models like ChatGPT to assess and mitigate systemic risks. It also introduces the obligation to report serious incidents. However, regulation is not a panacea — as the Mindgard case shows, the technical reality is that absolute safety for generative models cannot yet be guaranteed. The Act rather creates a legal framework that motivates companies to continuously improve.