Skip to main content

HARC: Microsoft Reveals Why AI Models Fail Against Jailbreaks — And How to Fix It

AI article illustration for ai-jarvis.eu
Researchers from Microsoft and Tsinghua University have revealed exactly how safety mechanisms work inside large language models — and why hackers can bypass them. The new HARC method reduced jailbreak attack success rates by nearly five times without limiting the model's general capabilities. What does this mean for safer ChatGPT, Claude, or Gemini — and why is it important for businesses deploying AI?

What happens inside an AI when someone tries to "trick" it

When you ask a language model a dangerous question — like instructions for making a weapon — the model usually politely refuses. But "jailbreak" attacks, like the famous DAN (Do Anything Now) or the more sophisticated PAIR, manage to break through this defense. How? The answer lies directly inside the neural network.

A research team led by Shei Pern Chua from Microsoft and Tsinghua University, in their new paper HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment (published July 1, 2026), showed that within the so-called residual stream — the internal "memory highway" of a language model — there are two separate signal pathways: one for recognizing harmfulness and another for deciding to refuse.

Think of it like two dashboard warning lights. The model knows the query is dangerous (the "harmfulness" light is on), but the "refusal" light may not turn on — and that's exactly the moment jailbreaks exploit.

Three attack types, three ways to break through

The research analyzed three main categories of jailbreak attacks and found that each suppresses safety signals in a different way:

  • DAN (persona framing) — suppresses the refusal signal, but the harmfulness signal remains active. The model "knows" it's doing something wrong but doesn't refuse.
  • PAIR (semantic rewriting) — conversely activates refusal, but pushes the harmfulness signal into negative values. The model thinks it's a safe query, so it has no reason to refuse.
  • CodeAttack (code obfuscation) — suppresses both signals simultaneously, making the harmful query look almost innocent.

But the most interesting part of the discovery is the second half: the researchers extended the analysis to the response position — the moment when the model is already generating text. They found that the model recognizes it's generating harmful content even when it failed to detect it during input processing. In other words: the AI realizes what it's writing while it's writing it, but it's too late — the response generation is already underway and cannot be stopped.

HARC: Fixing the cause, not the symptoms

Based on these findings, the team developed the HARC (Harmfulness-And-Refusal Coupling) method. It's a fine-tuning approach using LoRA that couples both signals together — on both the prompt side and the response side.

The core idea is elegant in its simplicity: when the model learns that activation of the harmfulness signal automatically triggers the refusal signal, an attacker cannot suppress one without the other. They would have to suppress both simultaneously — which is significantly harder.

A key advantage of HARC is that it only intervenes in a two-dimensional subspace (harmfulness-refusal) and leaves the rest of the model's internal architecture untouched. This prevents degradation of general capabilities — the model doesn't become overly cautious and doesn't refuse harmless queries.

Numbers that speak for themselves

In extensive experiments, HARC achieved the following results:

  • Attack Success Rate (ASR) reduction — jailbreak success dropped 4.67× for Llama-3.1-8B and 4.75× for Qwen-2.5-7B compared to the base model.
  • Capability preservation — the HARC-modified model achieves the same results across five standard general capability benchmarks as the original model.
  • No excessive caution — over-refusal (rejecting harmless queries) was actually lower than in the base model.
  • Cross-model transferability — the method works without architecture-specific tuning across 5 different model families and two sizes (7–8B and 70–72B parameters).

Compared to six competing methods — including Circuit Breakers, RepBend, CAST, DPO, and vanilla SFT — HARC achieved the best trade-off between safety, capabilities, and usability.

HARC vs. existing safety approaches

Existing methods for protecting models against jailbreaks fall into two camps. Training-time methods (like DPO or safety SFT) retrain the entire model, which often damages its general capabilities — the so-called "alignment tax." Inference-time methods (like Circuit Breakers or activation steering) work at runtime but typically increase the rate of false refusals.

HARC sits somewhere in between. It uses mechanistic understanding of the model's internal representations for targeted intervention, but it's still a training method — so the changes are permanent and don't require runtime overhead.

The research also revealed an interesting limitation: CodeAttack remains partially resistant even to HARC, because its representation in the residual stream is nearly orthogonal to both harmfulness and refusal directions. The model simply "doesn't see" anything to apply coupling to. This suggests that future safety methods will need to combine HARC with additional approaches for complete protection.

Why this matters for the European context

The EU AI Act, which is taking full effect in 2026, requires robust safety measures from AI system operators. European companies deploying or developing language models — from banking chatbots to legal assistants to internal enterprise AI — need assurance that their systems can withstand abuse attempts.

HARC is open-source (the code is available on Microsoft's GitHub) and its LoRA-based implementation means that the safety modification can be applied to open-source models like Llama or Qwen, which are popular in Europe for their language flexibility.

The method also works across different architectures — from Llama 3.1 to Qwen, Mistral, Gemma, and Phi. For European developers, this means they can apply a safety layer uniformly regardless of which model they're using.

What this means for everyday users

For end users of ChatGPT, Claude, or Gemini, HARC is good news. Every jailbreak represents a risk that the model will generate dangerous content — from cyberattack instructions to disinformation. The more robust the safety mechanism, the smaller the chance this will happen.

At the same time, HARC reveals a disturbing reality: models "realize" they're doing something wrong, yet they still do it. This phenomenon — where the model recognizes harmfulness at the moment it's already generating a response — is a mechanistic explanation for why jailbreaks are so hard to stop. The model isn't "stupid"; it just has internally disconnected safety circuits.

How is HARC different from the Circuit Breakers method?

Circuit Breakers is an inference-time method that directly interrupts dangerous output signals in real-time during generation. HARC, by contrast, is a training-time method — it retrains selected model layers so that harmfulness and refusal signals are permanently linked. HARC's advantage is that it requires no runtime overhead after deployment while better preserving the model's general capabilities.

Does HARC work on commercial models like ChatGPT or Claude?

HARC is designed for open-source models where you have access to internal representations (residual stream). With commercial APIs like ChatGPT or Gemini, you don't have access to model weights or activations, so HARC cannot be directly applied. However, the principles HARC is built on — coupling harmfulness and refusal signals — can be implemented by providers like OpenAI or Anthropic in their own training pipelines.

Can HARC be applied to non-English language models?

Yes. HARC operates at the level of the model's internal representations and doesn't depend on a specific language. The method was validated across five model families without architecture-specific tuning. If you have access to the weights of an open-source model (e.g., Llama 3.1, which supports multiple languages, or Qwen 2.5), you can apply HARC. The only requirement is the ability to extract residual activations during passes through training data.

X

Don't miss out!

Subscribe for the latest news and updates.