Skip to main content

AI in medicine: Why do large language models still fail in clinical reasoning?

AI article illustration for ai-jarvis.eu
AI in medicine promises immense possibilities, from faster diagnostics to personalized treatment. However, a new scientific study that analyzed 21 of the most powerful large language models (LLMs) brings a sobering reality: these systems still lack the ability for true clinical reasoning. Even though they can retrieve facts, they fail in the logical steps required for complex medical decision-making.

The idea that artificial intelligence will take on the role of an assistant or even a diagnostician has been alive in technological circles for several years. With the advent of models like GPT-4 or Claude 3.5, it seemed that the boundaries between human intelligence and machine analysis were blurring. However, as a report published on Medical Xpress shows, there is a vast difference between "knowing" and "reasoning."

The Discrepancy Between Knowledge and Logic: What is Clinical Reasoning?

To understand why models fail, we must first define what clinical reasoning is. It's not just about searching a database to find that symptoms A, B, and C often indicate disease X. A real doctor must integrate the patient's medical history, consider their age, co-existing conditions, currently used medications, and in cases of uncertainty, perform deductive steps that lead to the exclusion of the riskiest variants.

Large language models, such as those studied in the scientific work (e.g., published on PubMed), operate on the principle of predicting the next most probable word in context. They are masters of pattern recognition but lack a deep understanding of causality – that is, that thing A causes thing B. In medicine, where one faulty logical inference can have fatal consequences, this difference is critical.

Comparing the Top Performers: How do GPT, Claude, and Gemini Fare?

Looking at the current market leaders, we see that while models are constantly improving, their results in medical benchmarks are highly variable.

  • OpenAI GPT-4o: Currently one of the most widely used models. It excels at synthesizing information and can answer fact-based questions very accurately. Its weakness, however, is a tendency towards so-called hallucinations, where the model confidently invents a fact that does not exist in reality, which is unacceptable in medicine.
  • Anthropic Claude 3.5 Sonnet: This model is often praised for its ability to handle finer nuances and better follow instructions. In tests, it shows a higher degree of logical consistency than GPT but still does not reach the level of complex reasoning required in clinical cases.
  • Google Gemini 1.5 Pro: Thanks to its enormous context window, Gemini can process entire patient documentations at once, which is a huge advantage. Nevertheless, it turns out that even when processing a large amount of data, the model cannot correctly connect unrelated clinical indicators into a logical whole.

For comparison, most of these models offer a free tier (a free version with limited features) and a paid subscription for professionals, which is around 20 USD / month (approx. 460 CZK). API versions with pay-per-use (tokens) are available for businesses, allowing integration into hospital systems.

Practical Impact: What Does This Mean for Czech Doctors and Patients?

For the Czech healthcare scene, this result has two main aspects. The first is safety. If a doctor in the Czech hospital system starts using AI as a tool for rapid text analysis, they must be aware that the model may evaluate symptoms correctly but completely fail to understand the mutual interactions between medications.

The second aspect is regulation. Within the European Union, the AI Act is coming into force, which classifies systems used in medicine as high-risk. This means that developers face extremely strict requirements for transparency, accuracy, and human oversight. For Czech companies developing medical software, this means they cannot simply "connect" ChatGPT to a diagnostic tool without thorough verification and certification.

Availability in the Czech Republic: All the aforementioned models (GPT, Claude, Gemini) are fully available in the Czech Republic and work very well even in Czech localization. This is good news for administrative assistance (writing reports, summarizing medical records) but a warning signal for direct diagnosis.

Conclusion: AI as an Assistant, Not a Replacement

The study of 21 models clearly tells us that AI is not a "doctor in a box." However, it is an incredibly powerful tool for data processing, information organization, and administrative relief. The key to success is not replacing human intuition and reasoning with a machine, but creating a symbiosis where AI prepares the groundwork and the doctor performs the most important part – critical reasoning.

Can I use ChatGPT or Claude for my own medical diagnosis?

You should never rely solely on AI for diagnosing your symptoms. Models can hallucinate (invent facts) and lack the ability for true clinical reasoning. Always consult a qualified doctor.

What are the biggest risks of using AI in hospitals?

The main risks are incorrect data interpretation (misdiagnosis) caused by the model's lack of logical reasoning and the risk of sensitive patient data leakage if the system is not fully compliant with GDPR and the EU AI Act regulations.

Is AI in medicine legally permitted in the Czech Republic?

AI tools can be used in the Czech Republic, but their use for diagnosis is subject to strict rules for medical devices and the new European AI Act. Software must meet stringent certification standards to be officially used for treatment.