AI in Medicine: Can Chatbots Detect Stroke on a CT Scan? The Reality of Multimodal Models

April 11, 2026 jarvis

AI article illustration for ai-jarvis.eu

Multimodal artificial intelligence promises revolutionary changes in diagnostics, but the reality is still much more complicated. While chatbots can write an essay or code, their ability to interpret life-saving medical images is still at a very low level. A new study from 2025 shows that general models like GPT-4o or Gemini still cannot reliably detect internal brain bleeding, which is a critical deficiency in medicine.

Imagine a situation where a patient arrives at the emergency room with suspected stroke. A CT scan is a key tool to determine whether it is a hemorrhage or an infarct. In such a moment, every second counts. If an AI model were to assist the doctor at this point, its error could be fatal. This very problem is becoming the subject of intense research. A recent study published in the journal Cureus thoroughly tested how the most modern multimodal large language models (LLMs) handle this task.

Multimodal AI: When text meets image

To understand what this research is about, we must first explain the concept of multimodal model. Unlike classic text models that work only with words, multimodal models (such as current versions of GPT-4o, Gemini 1.5 Pro, or Claude 3.5 Sonnet) can process different types of data simultaneously – i.e., text, images, and video. In medicine, this means that the model should be able to "look" at a CT scan and then write a textual description or make a diagnosis about it.

In the study conducted by researchers from the University of Virginia and other institutions, these models were tested on the public PhysioNet dataset, which contains CT images with various types of intracranial hemorrhage (ICH). However, the results are somewhat sobering for AI enthusiasts.

Critical failure: The problem with "recall" (sensitivity)

The research focused on two main tasks: binary detection (is there bleeding in the image, or not?) and classification of bleeding subtypes (e.g., subarachnoid vs. epidural). For the first, binary detection, the model achieved an overall accuracy of only 0.52. This is essentially a random guess, which has no practical use in medicine.

From a technical point of view, however, the most interesting and at the same time worst parameter is called recall (sensitivity in Czech). Recall tells us what proportion of truly sick cases the model was able to correctly identify. For brain bleeding, the model's recall was only 0.14. This means that the model was able to correctly detect only 14% of actual bleeding cases. The rest (86%) it fails to recognize and labels as a healthy state.

For doctors, this is absolutely unusable. In medicine, a situation where AI says there is a problem but there isn't (false positive) is much less severe than a false negative result – i.e., when AI says everything is fine, while the patient has brain bleeding that requires immediate surgery.

Comparison with technology leaders

Although the study does not directly compare all models on the market, its results clearly show the difference between general multimodal models and specialized medical AI. Let's look at how current leaders in visual tasks stand:

OpenAI GPT-4o: Extremely capable in general image interpretation (e.g., describing what's in a photo), but lacks in-depth knowledge of radiological nuances. Price: ChatGPT Plus costs 20 USD/month.
Google Gemini 1.5 Pro: Offers a huge context window, allowing it to process a large number of images at once, but still suffers from low accuracy in specific medical diagnoses. Price: Gemini Advanced is available for approximately 20 EUR/month.
Anthropic Claude 3.5 Sonnet: Excels in logical reasoning and text analysis, but its visual capabilities are still oriented towards general objects, not microscopic details on a CT scan. Price: Claude Pro costs 20 USD/month.

Practical impact: What does this mean for hospitals and patients?

This research tells us that we cannot rely on general chatbots in diagnostic processes. For hospitals, including those in the Czech Republic, this means that investments in AI should not be directed towards purchasing general licenses for doctors, but towards implementing specialized systems that have been trained exclusively on medical data and have undergone certification.

From a regulatory perspective, the European Union and its AI Act play a key role here. Medical AI systems fall into the high-risk category. This means that software intended to assist with diagnosis must meet extremely strict requirements for transparency, safety, and accuracy before it is even allowed for use in the EU. General models like GPT-4o currently do not meet these requirements for medical purposes.

Availability and the Czech context

In the Czech Republic, the first implementations of AI in radiology are already appearing, but these are based on algorithms such as Convolutional Neural Networks (CNN), which are specialized in image analysis, not general language models. These systems are less "conversational," but are orders of magnitude more reliable in detecting pathologies. For a Czech doctor, it is important to know that even if they have access to ChatGPT in Czech, its ability to "see" medical images is still in an experimental phase.

Conclusion: The path to expert AI

The study's outcome is clear: general LLMs are excellent assistants for writing reports, summarizing texts, or explaining terms to patients, but as diagnostic tools for acute conditions, they are currently dangerous. The future of medicine does not lie in wanting a chatbot to be a doctor, but in having specialized models that will have access to a gigantic amount of medical data and will function as a highly accurate filter for radiologists.

Can AI completely replace a radiologist in the future?

According to current trends and research, no. AI will function as a "second pair of eyes" – a tool that alerts doctors to suspicious areas, but the final diagnostic responsibility and interpretation of the complex clinical picture will always remain with a human.

Are these models capable of communicating in Czech when analyzing images?

The image analysis itself (of pixels) is universal. However, the model's ability to generate a textual description in Czech is very high for models like GPT-4o or Gemini. The problem is not the language, but the accuracy of the diagnosis itself.

What is the cost of implementing such AI in a hospital?

For general models, API calls are paid for (e.g., with OpenAI, it's the price per token). For specialized medical systems, it involves high investments in licenses, integration into the hospital information system (HIS), and certification according to EU rules, which can mean millions of Czech crowns.