Skip to main content

Why Did ChatGPT Become Obsessed with Goblins? OpenAI Revealed a Technical Bug That Worries Experts

Ilustrační obrázek pro jarvis-ai.cz
A technical post-mortem from OpenAI revealed that the unusual tendency of ChatGPT to use metaphors about goblins, gremlins, and trolls was not accidental, but a result of a bug in the Reinforcement Learning from Human Feedback (RLHF) process. Although the responsible "Nerdy" mode was removed, its influence has already irreversibly imprinted itself into the weights of the base models GPT-5.1 and GPT-5.4.

What began as a bizarre internet meme on Reddit turned into a deep technical analysis of the stability of the most significant language models of today. Users noticed that ChatGPT started inserting terms like "goblin mode," "goblin bandwidth," or even offering a "goblin-ier version of code" into its responses in completely inappropriate contexts. For the average user, it was a funny glitch, but for engineers at OpenAI, it was a signal that their model steering tools might have unexpected side effects.

From "Nerdy" Mode to Digital Stains

According to a report published by Startup Fortune, the cause of the problem was very specific. OpenAI briefly tested various "personality modes," including a mode called Nerdy. This mode was meant to be playful, intellectually stimulating, and used rich, fantastical metaphors. To teach the model this style, developers used a reinforcement learning technique where the model received a "reward" for creative use of mythical terms.

The problem was that this reward signal was too strong. Although the Nerdy mode accounted for only 2.5% of all responses, it was responsible for a whopping 66.7% of all goblin mentions. The use of the word "goblin" in this mode increased by an incredible 3,881% compared to the base model. When OpenAI eventually turned off this mode, it found that it was too late. The learning mechanism had already "propagated" these preferences into the very weights of the base model, leading to these patterns appearing even in the GPT-5.1 and newer GPT-5.4 models, which no longer had any "Nerdy" mode.

Comparison with the Competition: How Are Others Handling This?

This incident opens the question of personality stability in large models. If we compare the current generation of models, we see different approaches:

  • OpenAI (GPT-5 series): It appears that OpenAI is experimenting with very strong "steering" (behavioral control), which leads to high creativity, but also to greater susceptibility to unwanted biases and personality drift.
  • Anthropic (Claude 3.5/4): Their approach based on "Constitutional AI" aims for greater predictability. Claude is generally considered less "wild" in its use of bizarre metaphors, making it a safer choice for corporate clientele.
  • Google (Gemini 1.5 Pro): Google focuses on integration and factual accuracy, with their models tending to be more conservative in the use of fantastical personas, which reduces the risk of similar "obsessions."

Practical Impact: What Does This Mean for Companies and Users?

For the average user paying a subscription for ChatGPT Plus at $20 per month (approximately CZK 470), such a response may be just an unpleasant noise. For companies, however, it is a critical problem. If a company integrates the ChatGPT API into its customer service, the model's unwanted "personality" can lead to damage to the brand's reputation.

For the Czech market and European sphere, this problem is even more sensitive in the context of the EU AI Act. The new regulation emphasizes transparency and reliability of AI systems. If a model exhibits unwanted behaviors that cannot be easily "turned off" (because they are already part of the model's base weights), it may pose a problem in certifying AI systems for critical applications within the EU. For Czech companies that are already beginning to implement AI, it is a warning sign that "turning off" a feature in AI is not the same as deleting a line of code in traditional software.

It is important to note that ChatGPT is fully available in Czech, but these linguistic patterns may function differently in Czech. While "goblin mode" may be an understandable metaphor in English, in the Czech context, the model may generate nonsensical or culturally inappropriate phrases that will come across to the Czech user more as a translation error than as a creative joke.

Technical Explanation: Why Can't the Bug Be Easily Fixed?

In language models, a process called weight propagation occurs. When you train a model using RLHF (Reinforcement Learning from Human Feedback), you change the internal parameters (weights) of the neural network. If the reward for a certain style (e.g., using goblins) is too strong, these weights become dominant. Even if subsequent training is conducted on clean data, these "dominances" are already deep in the architecture, and the model tends to revert to them. This is a phenomenon experts call "model drift" or unwanted bias stabilization.

Can I turn off these "personality" trends in ChatGPT?

The average user does not have direct access to editing the model's weights. However, you can use "Custom Instructions," where you explicitly prohibit the use of certain metaphors or styles, which can limit the model's behavior within a conversation.

Is this behavior a sign that AI is beginning to have its own consciousness?

No. It is a purely mathematical problem in weight optimization during training. The model merely repeats patterns that were inadvertently reinforced as "correct" or "rewarded" in its training process.

How do I recognize that the model is using an unwanted style that could be harmful to my company?

It is recommended to perform regular testing (so-called red-teaming) on specific scenarios and monitor the consistency of responses over time. If the model begins to use unusual terms in repeated tasks, it is necessary to reassess prompt settings or switch to models with greater control (e.g., Claude).

X

Don't miss out!

Subscribe for the latest news and updates.