In the field of modern medicine, there is constant discussion about whether and how much artificial intelligence can help doctors in decision-making. The latest research, focusing on the specialization of otorhinolaryngology (ORL – i.e., ears, nose, and throat), has provided concrete data. The study used 250 anonymized patient cases to test five of the most prominent models currently available: ChatGPT-5.1, Gemini 3 Pro, Grok 4, LLaMA 4, and DeepSeek V4-R1.
Benchmark in Clinical Practice: Who Leads in the Model Showdown?
The researchers did not just evaluate whether a model "wrote something," but focused on strictly defined parameters: diagnostic accuracy, adherence to professional recommendations (guidelines), and, most importantly, safety. The evaluation was performed by two certified ORL specialists using a 6-point Likert scale, where 1 meant completely wrong and 6 meant excellent.
The results are unambiguous. ChatGPT-5.1 achieved an average score of 5.72 out of 6, clearly surpassing its competitors. For comparison, although other models showed a high level of expertise, the difference in overall decision-making quality was statistically significant (p 0.001).
Here is a brief comparison of performance in key domains according to available data:
- Diagnostic Accuracy: ChatGPT-5.1 achieved a top result of 5.81.
- Adherence to Professional Procedures: ChatGPT-5.1 scored 5.77.
- Safety (risk of incorrect recommendation): ChatGPT-5.1 (0.4%) vs. Grok 4 (2.4%).
An interesting finding is that the Grok 4 model recorded the highest percentage of incorrect or unsafe recommendations, primarily due to errors in interpreting radiological findings or completely omitting key information. In contrast, ChatGPT-5.1 showed an extremely low rate of risky errors.
Technical Background: What Does This Mean for Medical AI?
To understand the results, it is important to define what these models can do. It's not about AI replacing doctors, but about it functioning as a Clinical Decision Support System. Models like LLaMA 4 or DeepSeek, while technologically fascinating and in many respects very capable in general tasks, still show slight fluctuations in consistency in specific medical subspecialties such as otology or rhinology.
In the context of the research, a high correlation was confirmed between a model's ability to adhere to professional standards and its diagnostic accuracy (r = 0.62). This means that if a model "knows the rules," it can also correctly interpret symptoms. For developers, this means that future model training must focus primarily on structured medical data and not just general texts from the internet.
Availability and Price for Users in the Czech Republic
If you are interested in these results from a practical application perspective, it is important to know that all tested models are also available to users in the Czech Republic.
- ChatGPT (OpenAI): Available in Czech. For advanced features (including access to the latest version 5.1), a ChatGPT Plus subscription is required, costing approximately 20 USD (approx. 460 CZK) per month. Gemini (Google): Full integration into the Google ecosystem, available in Czech. Gemini Advanced subscription is part of the Google One AI Premium package (approx. 400 CZK/month).
- Llama 4 (Meta): As an open-source model, it is available for free to developers, but requires its own infrastructure or cloud services.
Practical Impact: What Does This Mean for Czech Medicine and the EU?
This research has a fundamental impact not only on technological development but also on legislation. The AI Act is already in force in the European Union, classifying systems used in medicine as high-risk. This means that any model that is to be officially used for diagnosis in Czech hospitals in the future must undergo an extremely strict certification process.
For the Czech medical community, this means that we will not yet see "AI doctors" in every practice, but very soon, assistants will begin to appear in hospital systems to help with report transcription, drug interaction checks, or X-ray image analysis based on the outputs of these models. The study's finding that ChatGPT-5.1 is almost flawless in safety is crucial for this implementation.
For the average user in the Czech Republic, however, there is a warning: even though these models are incredibly intelligent, they are still statistical text predictors. As the study shows, even with a top model, errors can occur, although they are very rare. In medicine, therefore, the principle of "Human-in-the-loop" – meaning a human who always checks the final decision – remains essential.
Can a doctor in the Czech Republic legally use ChatGPT for patient diagnosis?
No, currently ChatGPT does not serve as certified medical software. However, it can be used as an auxiliary tool for text analysis or preparing background materials, with the final diagnostic responsibility remaining solely with the doctor.
Is ChatGPT-5.1 safe for use in Czech?
The study focused primarily on English professional texts and standard medical protocols. Although the model is very capable in Czech, professional review is still necessary when interpreting specific Czech medical terms to avoid incorrect translation interpretation of a professional term.
What is the main difference between the models in this study?
The main difference lay in the level of safety and adherence to rules. While ChatGPT-5.1 showed minimal errors (0.4%), models like Grok 4 had a significantly higher rate of incorrect recommendations (2.4%), especially in the area of image diagnostics interpretation.