Irrationality in AI: Why GPT-5.5 and DeepSeek-V4 Fail, Even When They Know the Correct Answer

June 23, 2026 Daniel Cesak

AI article illustration for ai-jarvis.eu

  Even though language models like GPT-5.5 or DeepSeek-V4 "know" the correct answer — it is among their generated candidates — they often fail to select it. Researchers from the University of Edinburgh mathematically described this phenomenon as rational value risk and showed that it is a standalone problem that alignment alone will not solve. In other words: irrationality in AI reasoning exists independently of how well the model is aligned with human values.

What is rational value risk and why it matters

Researchers Kejiang Qian and Fengxiang He from the University of Edinburgh, in their study In LLM Reasoning, there is Irrationality on top of Value Misalignment (published on May 26, 2026), introduce the concept of rational value risk (RVR) — a metric that measures the difference between how well a model could answer (if it selected the best of its generated responses) and how it actually answers.

This is a crucial distinction. Until now, most attention has been paid to value alignment — that is, ensuring the model "wants" to give correct and safe answers at all. Methods like RLHF (reinforcement learning from human feedback), DPO (direct preference optimization), or RLVR (reinforcement learning with verifiable rewards) focus on ensuring the model internalizes correct values during training. However, the Edinburgh study shows that even a perfectly aligned model can fail at the moment of reasoning — it simply doesn't choose the best answer it itself generated.

How scientists measured irrationality

The authors tested models across sizes and manufacturers: from open-source families Llama-3.1, Qwen-2.5, and Tülu-3 (7–72 billion parameters) to commercial leaders including GPT-5.2, GPT-5.5, and DeepSeek-V4. Testing was conducted on six benchmarks — from conversational tasks (UltraFeedback, AlpacaEval) through mathematical reasoning (GSM8K, MATH, MathArena) to code generation (HumanEval).

The measurement principle is elegant: for each prompt, K = 64 candidate answers are generated (at a temperature of 1.0). The rational answer is the one with the highest utility value — for mathematical tasks, it's simply the correct answer; for conversational tasks, it's the one preferred by the verifier. RVR then measures how much worse the actually deployed strategy is than this "oracle" choice.

Four hypotheses confirmed by experiments

H1: Rational value risk is ubiquitous

Across all models and benchmarks, scientists found that RVR is systematically greater than zero. For smaller models (7–8B parameters), it reaches values of 0.30–0.49 on conversational tasks and 0.08–0.48 on mathematical tasks. Even the largest model tested — Qwen2.5-72B-Instruct — exhibits an RVR of 0.01–0.20 depending on the benchmark.

Translated into plain language: models consistently leave better answers on the table. They generate them but don't select them.

H2: Alignment reduces, but does not eliminate, irrationality

One of the most interesting findings comes from comparing Tülu-3 models in three training phases: SFT → DPO → RLVR. On the GSM8K benchmark, RVR dropped from 0.40 (SFT) to 0.13 (DPO) to 0.12 (RLVR). On AlpacaEval, from 0.49 to 0.09. Alignment thus helps dramatically — but never reaches zero.

In other words: even after the most advanced alignment, "residual irrationality" remains in the model, which training methods do not remove.

H3: Irrationality is extremely sensitive to inference strategy

The same frozen model behaves differently rationally depending on sampling temperature, the use of self-consistency (majority voting), and other inference parameters. This means that rationality is not just a property of the model — it is a property of the entire inference pipeline. For companies deploying LLMs into production, this implies a practical lesson: optimizing the inference strategy can be as important as choosing the model.

H4: Longer reasoning helps, but with diminishing returns

More tokens when generating a response improve rationality — the model has more "time to think." However, the effect diminishes after a certain length. This confirms the experience of many developers that endlessly extending the reasoning chain does not bring a proportional improvement.

MathArena: Where even GPT-5.5 fails

The MathArena benchmark deserves special attention — a set of difficult mathematical problems that were published only after the tested models were trained. This is therefore a pure test of generalization without the risk of training data contamination.

On MathArena, scientists decomposed the total utility loss into two components: misalignment (the model does not generate the correct answer at all) and irrationality (the model generates it but does not select it). The results are remarkable:

GPT-5.2: 70.6% of utility loss is due to irrationality, not misalignment
DeepSeek-V4-Flash: 63.7% of loss is irrationality
GPT-5.5: 57.5% of loss is irrationality
Qwen2.5-72B: 38.8% of loss is irrationality

These numbers are key: they show that for the most powerful models, the main problem is no longer that they don't know the answer — but that they can't choose it.

What this means in practice

For developers and companies currently deploying large language models — whether Czech startups using the ChatGPT API or corporations hosting their own Llama or Qwen instances — the research provides concrete lessons:

Alignment alone is not enough. Even if you use the best-aligned model on the market, you still need mechanisms at the inference level to ensure the model selects the best answer. Techniques such as self-consistency (generating multiple answers and selecting the majority), best-of-N sampling, or verifier-guided decoding thus become an indispensable part of production deployment.

Larger models are more rational. Data clearly show that RVR decreases with increasing model size. Qwen2.5-72B has significantly lower RVR on most benchmarks than its 7B variant. This is an argument for deploying larger models where reasoning quality matters most — for example, in legal, medical, or financial applications.

European context. The research from the University of Edinburgh is a European footprint in a debate dominated by American and Chinese laboratories. Furthermore, the code is open-source on GitHub, so Czech researchers and companies can also use it to test their own models. For the Czech environment, where AI is only now gaining wider awareness, this is a valuable contribution to understanding the limits of current systems.

What is the difference between misalignment and irrationality in AI models?

Misalignment means that the model did not acquire the correct "values" during training — its internal value function is not in line with what we expect from it. Irrationality, on the other hand, is a failure in the reasoning itself: the model has the correct answer available (it generated it among its candidates), but for some reason, it does not select it as the final output. The Edinburgh study shows that even a perfectly aligned model can suffer from irrationality.

Can the irrationality of AI models be completely eliminated?

According to current knowledge, not yet. Researchers have shown that advanced alignment methods like DPO and RLVR significantly reduce rational value risk (for example, from 0.40 to 0.12 for mathematical tasks), but never bring it to zero. At the same time, it turns out that techniques at the inference level — such as self-consistency or best-of-N sampling — can further suppress irrationality. It is therefore more about continuous optimization than a binary problem.

Does this problem also affect smaller models running locally?

Yes, and for smaller models, RVR is typically even higher. The research also tested 7–8B models like Llama-3.1-8B and Qwen2.5-7B, where irrationality was most pronounced. For users running AI locally on their own hardware — for example, models from the Llama or Qwen family on consumer GPUs — this means they should pay special attention to the inference strategy, as smaller models "discard" correct answers more often than their larger siblings.