What Mindgard's research uncovered
British startup Mindgard, which specializes in so-called red-teaming — finding ways to make AI models bypass their own safety rules — discovered a disturbing vulnerability in ChatGPT. All it took was taking a publicly shared prompt, originally designed to create humorous images, and tweaking it slightly. The result was graphic scenes that Mindgard founder Peter Garraghan (also a professor at Lancaster University) described as "very scary, sometimes sexualized, sometimes both at once."
Researcher Jim Nightingale, who discovered the vulnerability, said he was "shaken and in tears" by what ChatGPT was able to generate. The BBC, which had the opportunity to view the images, described examples including:
- A dead young woman in a crop top and shorts with a bloodied face — ChatGPT titled the image "Grim crime scene aftermath"
- A young woman bound and gagged in a dirty room — the title read "abandoned in fear and restraint"
- A man with extensive head injuries
- Images depicting sexual posing and nudity
Most disturbing was that the prompt did not specify the subject matter of the images — the AI generated them "of its own volition," as Garraghan put it. "This is a perfectly innocent-looking instruction for the AI, but the consequence is the generation of very, very bad images and content," he added.
How can safety filters be bypassed?
The principle of the attack, known in the field as a jailbreak, involves finding a phrasing that confuses the model. ChatGPT has several layers of protection — text classifiers that block harmful requests before generation, and image filters that check the output. But models don't understand intent the way humans do.
"Models don't understand intent. They don't understand context. They don't understand what is appropriate or what is right and wrong," Dr. Rumman Chowdhury, an expert in AI model evaluation and director of Humane Intelligence, explained to the BBC. She likened the entire situation to a "game of cat and mouse" — as soon as safety protection improves, more sophisticated methods of bypassing it emerge.
OpenAI responded by saying that "after investigating this trend, we have introduced additional safeguards against this type of prompt." The company also emphasized that it has several layers of protection — upstream refusal (blocking before generation), downstream blocking (output checking by a safety model), and a combination of automated systems with human review.
But according to researchers at Mindgard, making further minor tweaks to the prompt made the vulnerability work again. OpenAI initially responded to the May 2026 notification with only an automated reply — more significant measures came only after the BBC's intervention.
What OpenAI's own data says
In the official ChatGPT Images 2.0 system card, which OpenAI published this April, the company admits that even its most advanced safety stack is not 100% effective. In testing with 3,112 adversarial prompts:
- In instant mode, 3.9% of harmful images passed through both layers of protection (combined detection success rate of 96.1%)
- In "thinking" mode (which uses reasoning for better quality), as many as 12.5% of harmful outputs got through
This means that even in targeted testing, where OpenAI knows what threats to expect, some harmful content slips through the protection. In real-world operation, where attackers are constantly finding new paths, the situation could be even worse.
Broader context: AI models are not human
The problem is not isolated to ChatGPT. Last year's research by the UK AI Safety Institute (AISI) found that jailbreaks were able to overcome safety guardrails across all tested AI systems. The UK Department for Science, Innovation and Technology commented that "safeguards in AI models are improving, but there is still a lot of work ahead."
The core of the problem lies in the data that large language models are trained on. ChatGPT and other models learn from millions of images downloaded from the internet — and that includes violent, sexual, or otherwise problematic material. As researcher Nightingale noted: "Even though what I saw was generated, an artificial image, it has links to real images and the real world."
What this means for Czech users and businesses
ChatGPT is one of the most widely used AI tools in Czechia, both among individuals and in companies. Although OpenAI does not officially allow access to paid plans directly in Czech koruna (ChatGPT Plus subscription costs $20 per month, Pro is $200), Czech users commonly use the service via international payment cards.
Mindgard's findings are relevant to everyone who uses ChatGPT — especially parents, schools, and businesses that deploy AI tools in environments where minors have access. While OpenAI claims that ChatGPT is intended for users aged 13 and older (with restrictions up to age 18 without parental consent), in practice age verification is minimal.
In the European context, the topic is growing in importance due to the EU AI Act, which since February 2025 has introduced stricter regulation for high-risk AI systems. Generative models like ChatGPT fall under the rules for "general-purpose AI," where developers must demonstrate an adequate level of safety and transparency. The Mindgard case shows how difficult it is to define — and especially to maintain — this "adequate level."
Mindgard: Who's behind the research
Mindgard is not an academic project but a commercial AI security startup that offers red-teaming as a service to companies. Its founder Peter Garraghan also serves as a professor at Lancaster University, giving the research a solid academic foundation. The company states on its website that it helps organizations "identify, test, and fix security vulnerabilities in AI systems before attackers discover them."
This model — independent security firms uncovering flaws in tech giants' products — has become the standard in recent years. The aforementioned UK AISI and the US NIST operate similarly. For users, this is good news: the security community is actively looking for problems, which means tools gradually improve. The bad news is that the pace of vulnerability discovery is currently outpacing the pace of fixes.
Is ChatGPT safe for children and teenagers?
OpenAI states that ChatGPT is intended for users aged 13 and older, with minors under 18 requiring parental consent. However, Mindgard's research shows that safety filters are not foolproof. Parents should consider whether to allow children access to image generation features and, where available, use parental control tools.
How can I tell if an image was generated by AI?
ChatGPT Images 2.0 includes an invisible watermark and metadata according to the C2PA standard, which enables verifying the image's origin. However, an ordinary user cannot detect this without specialized tools. OpenAI is also working with the industry to improve transparency — yet verifying the origin of AI images remains challenging in practice.
Can the EU AI Act prevent similar incidents?
The EU AI Act requires developers of general-purpose AI models like ChatGPT to assess and mitigate systemic risks. It also introduces an obligation to report serious incidents. However, regulation is not a silver bullet — as the Mindgard case shows, the technical reality is that absolute safety for generative models cannot yet be guaranteed. The Act primarily creates a legal framework that incentivizes companies to continuously improve.