Grok and Copilot Pick Statistical Tests More Accurately Than ChatGPT: New Study Reveals Who to Trust

June 30, 2026 Daniel Cesak

Every researcher knows the feeling — you stand before a data table and must choose the right statistical test. A mistake at this stage can waste months of work and lead to misleading conclusions, especially in healthcare where lives are at stake. A new study published in the medical journal Cureus therefore tested whether large language models could help with this thankless task. The results were surprising: Grok by xAI and Microsoft Copilot both correctly identified 34 out of 40 tests (85%), Google Gemini was just slightly behind, and ChatGPT — despite being the most famous — came in last with 75% accuracy. Traditional search engines Google and Bing were unable to recommend a single correct test.

Why Choosing the Right Statistical Test Is So Critical

Healthcare research is built on data. But for data to actually mean something, it must be analyzed using the correct method. Using the wrong statistical test can lead to false positive results — claiming a drug works when it actually doesn't — or conversely, overlooking a real effect. In clinical practice, this means potentially endangering patients.

The problem is that choosing the right test is far from trivial. It depends on the type of variables (continuous vs. categorical), data distribution (normal vs. non-normal), the number of compared groups, study design, and many other factors. Even experienced researchers sometimes get it wrong — and this is exactly where AI could play the role of an assistant.

How the Study Was Conducted

Authors Michael Paolella and Aditya Tadinada selected 40 published scientific articles across four of the most common study designs: systematic reviews, randomized controlled trials, cohort studies, and case-control studies (10 of each type). From each article, they extracted the primary research question and the statistical method used, then created a standardized prompt describing the research scenario.

These prompts were then submitted to four large language models — ChatGPT (OpenAI), Google Gemini, Microsoft Copilot, and Grok (xAI) — as well as two traditional search engines (Google and Bing). The model responses were compared with the statistical tests actually used by the original study authors. Accuracy was defined as agreement between the model's recommendation and the actual test used.

Results: Grok and Copilot Lead, ChatGPT Last

The numbers speak clearly. Grok and Microsoft Copilot both achieved 85% accuracy (34 correct recommendations out of 40), followed by Google Gemini at 80% (32/40) and ChatGPT at 75% (30/40). Traditional search engines Google and Bing could not recommend a single test that matched what researchers actually used — their results were practically unusable for this task.

Interestingly, the gap between the best and worst LLM was just 4 tests out of 40 — or 10 percentage points. This suggests that all tested models have a basic understanding of statistical methodology, but Grok and Copilot performed more consistently. The study authors emphasize that while 75–85% accuracy is promising, it is still insufficient for use in real research — every fifth to fourth recommendation was wrong.

Why Traditional Search Engines Failed Completely

The result for Google and Bing is perhaps the most surprising part of the study. While LLMs understand context and can recommend a specific statistical method based on a description of the research scenario, traditional search engines only return links to existing pages — they provide no synthesis or recommendations. In an era when Google is increasingly integrating AI Overviews into its search, this result serves as a wake-up call: it will be interesting to watch whether AI-powered search catches up with specialized chat models on similar tasks.

What This Means for Researchers — Including Czech Ones

For academics, doctoral students, and research teams in the Czech Republic and across Europe, the takeaway is clear: LLMs can serve as a useful first step in choosing a statistical method, but they must not be the final authority. In other words — asking AI is faster than flipping through a textbook, but the answer should always be verified by someone with statistical training.

At Czech universities and research institutions such as Masaryk University, Charles University, or the Brno-based RECETOX, hundreds of studies are conducted annually where such an AI assistant could save hours of work. What's more, all four tested models support Czech — the prompt can be entered in Czech and the model responds in the same language, removing the language barrier for researchers less proficient in English.

The economic aspect is also interesting. While Copilot is part of Microsoft 365 (from approximately 170 CZK/month with a subscription), ChatGPT has a free version, Gemini also offers a free tier, and Grok is available through an X Premium+ subscription (approximately 380 CZK/month). All four models can thus be used for free or at relatively low cost, which is good news for the academic sphere where software budgets tend to be tight.

Study Limitations Worth Knowing

The study has several limitations that the authors honestly acknowledge. First, accuracy was defined as agreement with the original article — but that doesn't mean the original article authors themselves used the optimal test. In research, it commonly happens that even published studies contain methodological errors.

Second, the sample of 40 articles is relatively small and limited to four study types. In healthcare research, there is a much wider range of designs — from cross-sectional studies through meta-analyses to diagnostic accuracy studies. How LLMs would perform in these more complex scenarios remains unknown.

Finally, the study only tested models at a specific point in time — but LLMs are constantly evolving. The version of ChatGPT tested in June 2026 is no longer the same one that would have been tested six months earlier. This is why the authors recommend regular re-evaluation to track how model accuracy changes over time.

Broader Context: AI in Healthcare Research

This study fits into a growing trend of evaluating LLMs in healthcare. In the last two years, dozens of papers have been published testing AI models' ability to diagnose diseases, analyze medical images, or — as in this case — assist with research methodology. According to a systematic review published in ACM Transactions on Multimedia Computing (2026), 2026 is a turning point: LLMs are no longer just experimental toys but are becoming practical tools for clinical and research workflows.

At the same time, as AI use grows in sensitive areas like healthcare, so do regulatory requirements. The EU AI Act, which took effect in 2025, classifies systems used in healthcare as high-risk — meaning that any AI tool that might in the future assist with selecting statistical methods for clinical trials will be subject to strict certification and oversight. Czech research institutions considering deploying such a tool should plan for this regulatory framework.

Can I rely on AI recommendations when choosing a statistical test?

For now, only as a rough guide. Even the best models in this study (Grok and Copilot) got it wrong in 15% of cases. In healthcare research, where a wrong analysis can have serious consequences, an AI recommendation should always be verified by someone with statistical training. Use LLMs as a quick first opinion, not as a definitive authority.

Which of the tested models is most practical for a Czech researcher?

It depends on your preferences and budget. ChatGPT and Gemini offer solid free versions with Czech language support. Microsoft Copilot is advantageous if you already pay for Microsoft 365 (from 170 CZK/month). Grok requires an X Premium+ subscription (approx. 380 CZK/month). All four models understand Czech and can respond in Czech, so the language barrier isn't an issue.

How can I tell if the AI recommended the correct test?

Ask the model for its reasoning — why did it recommend this particular test? If it provides specific arguments (normal data distribution → t-test, ordinal data → Mann-Whitney, etc.), you can verify them in a statistics textbook or with a statistician colleague. A good practice is also to ask two different models and compare their answers — if they agree, it increases the likelihood that the recommendation is correct.