Skip to main content

LLM vs. statistician: How accurately do ChatGPT, Claude, and Gemini select statistical tests in healthcare research?

AI article illustration for ai-jarvis.eu
Can ChatGPT, Claude, or Gemini correctly select a statistical test for scientific research? Two recent studies tested the accuracy of six of the most popular language models on dozens of hypothetical research scenarios from healthcare. The results are surprising: in one of the studies, all models achieved a hundred percent success rate. But beware — the accuracy of test selection and the quality of explanation are two very different things.

Why statistics is every researcher's pain

Healthcare researchers regularly grapple with the question of which statistical test to use for specific data and hypotheses. Choosing the wrong test means invalidating the entire study — or arriving at misleading conclusions. Yet, statistical consultation is often expensive and unavailable in many countries. This is precisely why researchers are increasingly turning to large language models (LLMs) as a quick and cheap alternative.

But the question is: can they be trusted? Two studies published in 2024 and 2025 in renowned scientific journals attempted to answer this question.

Study No. 1: Six models, twenty scenarios, one hundred percent success rate

The more complex of the studies, published in October 2025 in the journal Cureus and indexed in PubMed (PMC12627256), tested a total of six current language models: ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat. Each model was given twenty research scenarios typical for clinical and epidemiological research.

The result was surprisingly unambiguous: all models selected the correct statistical test in 100% of cases. Whether it was a paired t-test, one-way ANOVA, Mann-Whitney U-test, Kruskal-Wallis test, chi-square, or Fisher's exact test — none of the models made a mistake.

This might sound like the end of the story. However, the researchers went further and also evaluated the quality of the explanation, not just the correctness of the answer. Five independent biostatistics experts assessed each answer across five dimensions:

  • Clarity
  • Identification of test assumptions
  • Pedagogical value
  • Problem-solving approach
  • Statistical reasoning

And this is where the models began to differ. Claude excelled in clarity — an average score of 4.65 out of 5.00. Gemini achieved the best rating in pedagogical value, meaning the ability to explain why a given test is suitable and how to interpret it. ChatGPT performed worst in statistical reasoning, even though it was strong in the problem-solving approach itself. DeepSeek, Grok, and Le Chat performed on average without significant fluctuations.

Study No. 2: Pilot comparison of four models on 27 scenarios

An older pilot study, published in PMC (PMC11584160), examined four models available in 2024: ChatGPT 3.5, Google Bard, Microsoft Bing Chat, and Perplexity. The researchers prepared 27 case vignettes simulating typical situations in healthcare research.

This time, the results were not as clear-cut:

  • Microsoft Bing Chat: 96.3% agreement with expert recommendation, 100% acceptability
  • ChatGPT 3.5 and Perplexity: 85.19% agreement, both with 100% acceptability
  • Google Bard: 77.78% agreement, 96.3% acceptability

The overall agreement rate between the models was moderately high (ICC = 0.728). Testing for consistency was also interesting: after seven days, the models received rephrased versions of the same questions. ChatGPT and Perplexity performed consistently, while Bard and Bing Chat fluctuated more.

The study concluded that LLMs cannot fully replace a human statistician but are "reliable tools for statistical advice" — especially for researchers in countries where access to statistical consultation is limited or financially demanding.

What this means in practice — and which model to choose?

The practical conclusion for researchers is clear: when selecting a statistical test, modern language models can be relied upon with a high degree of confidence. All tested models handle basic and advanced tests — from simple correlation to logistic regression and the Wilcoxon test for paired data.

The choice of a specific model depends on what you expect from the answer:

  • Do you need a clear explanation for students or junior researchers? Opt for Claude.
  • Are you looking for educational value and context on why a test is suitable? Gemini is your choice.
  • Do you want a quick answer with verified sources? Perplexity or Bing Chat perform well.

All mentioned models are available in Czech, although when submitting statistical queries, experts recommend formulating questions in English — the terminology is more precise, and the models are better trained on it.

Beware of limits: accuracy is not everything

Even 100% accuracy in test selection does not mean that the model can be blindly relied upon. The authors of both studies point out several important limitations:

Models may overlook data specifics. LLMs respond based on a textual description of the scenario — they cannot actually check the distribution of values, the presence of outliers, or the fulfillment of test assumptions (normality, homogeneity of variances). These must always be verified by the researcher themselves.

Halucination risk. Language models can confidently recommend a test that is not ideal in a given context — especially for rarer or more advanced methods. Therefore, it is advisable to verify the model's recommendation in methodological literature or consult with a colleague.

Model version matters. The pilot study tested ChatGPT 3.5 — the current GPT-4o version is significantly more powerful and would very likely achieve better results.

Czech and European perspective

For Czech researchers and medical students, these findings are particularly relevant. Access to quality biostatistical consultation is not a given in the Czech Republic — especially in smaller workplaces or doctoral programs. LLMs can serve as a free first advisor, guiding the researcher in the right direction before consulting with an expert.

All tested models are available in their basic versions for free to Czech users: ChatGPT at chat.openai.com, Claude at claude.ai, Gemini at gemini.google.com, Perplexity at perplexity.ai. Premium versions typically cost around 20 USD (approximately 450 CZK) per month.

From the perspective of the EU AI Act, these models fall under so-called general-purpose AI (GPAI) — they are therefore regulated at the level of transparency and safety, not as specialized medical AI systems. Stricter standards should still apply to clinical decision-making in healthcare.

Can an LLM like ChatGPT or Claude replace a statistician in research?

Not entirely. An LLM can reliably recommend the correct statistical test and explain its logic, but it cannot verify actual data, their distribution, or the fulfillment of test assumptions. It serves as valuable first aid, not as a full replacement for expert consultation.

Which model is best for statistical advice?

According to the October 2025 study, in terms of test selection accuracy, all major models (ChatGPT, Claude, Gemini, DeepSeek, Grok, Le Chat) are equivalent — they achieved 100% success. Differences lie in the quality of explanation: Claude excels in clarity, Gemini in pedagogical value.

Is it safe to input sensitive healthcare data into ChatGPT or Claude for statistical queries?

No. For research queries, always use anonymized or fictitious data — never actual patient identifiers. Commercial versions of ChatGPT and Claude process data on servers in the USA, which is contrary to GDPR when handling personal health data.

X

Don't miss out!

Subscribe for the latest news and updates.