Can AI truly help us in medicine? Testing the reliability of models from diagnostics to statistics

July 1, 2026 jarvis

    In medicine and scientific research, there is no room for error. While an AI hallucination in a common text might just mean an embarrassing mistake, in neurological diagnostics or when selecting statistical tests, it can have fatal consequences. Current studies from 2025 and 2026 focus on whether we can truly rely on large language models (LLMs) when patients' lives and the validity of scientific knowledge are at stake.

Looking at the current developments in artificial intelligence, we see a fascinating shift. AI is no longer just a tool for writing emails or generating images; it is becoming a sophisticated assistant in the most demanding professions. However, as recent research shows, the path to full deployment of AI in medicine is still full of obstacles, especially in the area of long-term planning and complex statistical reasoning.

Neurological Diagnostics: Which Model Leads the Race?

One of the studies published in Cureus focused on an extremely challenging medical scenario: Guillain-Barré syndrome (GBS) accompanied by spinal epidural lipomatosis (SEL). The test aimed to assess AI's ability not only to diagnose correctly but also to propose a comprehensive treatment plan.

Three leading models were included in the test: ChatGPT, Google Gemini, and Claude 3.5 Sonnet. The results were clear. In an evaluation conducted by four certified physicians, Claude 3.5 Sonnet showed the highest accuracy rate with a score of 18.5 out of 20 points. ChatGPT followed with 17.5 points, and Gemini finished in third place with 17.25 points.

An interesting finding is that while all models excelled in the diagnosis itself (correctly identifying GBS) and in proposing immediate treatment (such as IVIG or plasmapheresis), they failed in the area of follow-up planning. This means that AI can tell what to do now, but it struggles with detailed planning of long-term rehabilitation and subsequent patient monitoring. For doctors, this means one thing: AI is excellent for quick consultation, but it certainly cannot be left to manage the entire patient care process.

Statistical Accuracy: Can a Scientist Trust LLMs?

Another critical aspect concerns scientific research. For a study's results to be valid, the scientist must use the correct statistical test. If the wrong test is chosen, the entire research is essentially invalid. The research paper by Shukla et al. examined the ability of six models (including newer players like DeepSeek and Grok) to select the correct test for various hypotheses.

This research shows that LLMs are very strong in explaining concepts, but their ability to decide on complex statistical parameters (e.g., when comparing medians vs. means for non-parametric data) still requires human oversight. For academia, this means that AI can serve as a great tutor for students, explaining why a t-test is used, but it must not be the final arbiter when reviewing scientific work.

Comparison of Leading Models in a Medical Context

For readers who want to know which tool to choose for their needs (e.g., for analyzing specialized texts or assisting with research), here is a brief comparison:

Claude 3.5 Sonnet (Anthropic): The current leader in nuanced reasoning and medical logic. Excellent for in-depth text analysis.
Price: Free tier available, Claude Pro approx. 20 USD/month.
ChatGPT (OpenAI): A versatile standard with the largest community and a wide range of integrations. Good for quick diagnostic assistance.
Price: Free tier, ChatGPT Plus approx. 20 USD/month.
Google Gemini (Google): Strong due to integration into Google Workspace and the ability to work with vast amounts of data (long context window).
Price: Free tier, Gemini Advanced approx. 20 USD/month.
DeepSeek / Grok: Interesting alternatives for specific technical and mathematical tasks, which are still rapidly profiling themselves in the market.

Practical Impact: What Does This Mean for Czechia and the EU?

For a Czech doctor, researcher, or medical student (e.g., at Charles University), this report has two main implications:

Availability and Language: All mentioned models are available in the Czech Republic. Although the models primarily learn from English data, their ability to understand Czech medical terminology is high, but still requires extra caution due to specific Czechoslovak medical nomenclature.
Regulation (EU AI Act): Within the European Union, AI systems used in medicine fall into the high-risk category under the new Artificial Intelligence Act (AI Act). This means that developers must meet extremely strict requirements for transparency and safety. For Czech healthcare, this means that we must not use "ordinary" chatbots for clinical decisions without certification as a medical device.

Summary: AI in medicine is not a substitute for a doctor, but an incredibly powerful assistant. It can help you identify patterns in data faster or suggest a diagnosis, but the final responsibility for treatment planning and statistical validity remains with the human.

Can I use ChatGPT for self-diagnosis of diseases?

Never. Research confirms that AI can hallucinate or omit key aspects of long-term care. Always consult a medical professional.

Is Claude 3.5 Sonnet better than ChatGPT for scientific research?

According to current studies in neurological care, Claude shows a higher degree of detailed reasoning and accuracy, which is an advantage for complex medical cases.

What is the relationship between AI and the EU AI Act in Czech healthcare?

The EU AI Act classifies AI in medicine as high-risk. This means that tools used for diagnosis must be strictly regulated and certified to ensure their safety and reliability.

Can AI truly help us in medicine? Testing the reliability of models from diagnostics to statistics

Neurological Diagnostics: Which Model Leads the Race?

Statistical Accuracy: Can a Scientist Trust LLMs?

Comparison of Leading Models in a Medical Context

Practical Impact: What Does This Mean for Czechia and the EU?

Don't miss out!