Skip to main content

AI in radiology: Large language models translate medical reports. Which one is best?

Ilustrační obrázek
When a patient receives a radiology report in a language they don't understand, it can mean delayed diagnosis, unnecessary stress, or even incorrect treatment. With increasing migration, cross-border healthcare, and the rise of telemedicine, language barriers in healthcare are becoming an ever more pressing problem. Researchers from the prestigious journal Radiology therefore tested 10 large language models as translators of radiology reports across nine languages. The results show that AI handles translation surprisingly well — but it's not yet ready for deployment in clinical practice.

Why translating medical reports is so important

Radiology reports — whether from CT, MRI, or X-ray — contain highly specialized terminology that even native speakers often don't fully understand. When you add a language barrier to that, the situation becomes dramatically more complicated. According to a study published in December 2024 in the journal Radiology (published by the Radiological Society of North America), this problem affects millions of patients annually. With the increase in refugee waves that have also hit Europe in recent years — including the Czech Republic — and with the growing popularity of cross-border medical consultations within the EU, the need for high-quality translation is more urgent than ever. Human translators specializing in medical terminology are, however, rare and expensive. This is exactly where large language models step in — not as a replacement, but as an assistant that can provide a first orientational translation within seconds.

How the study worked: 10 models, 9 languages, 100 reports

The research team led by Ken Bressem from the German Heart Center in Munich (Institut für kardiovaskuläre Radiologie und Nuklearmedizin) created a set of 100 fictional radiology reports from CT and MRI examinations. They had these translated by 18 radiologists into nine languages and then assigned the same task to ten large language models. The tested models included both commercial and open-source solutions: - GPT-4 and GPT-3.5 (OpenAI) - Llama 2 70B and Llama 3 70B (Meta) - Mixtral 8x7B, Mixtral 8x22B, Mistral 7B, and Mistral Large (Mistral AI) - Qwen1.5 72B (Alibaba) - Yi-34B (01.AI) Languages were divided into two categories: high-resource (English, Italian, French, German, Chinese) and low-resource (Swedish, Turkish, Russian, Greek, Thai). This division is crucial — LLMs trained predominantly on English data often struggle significantly with less-represented languages. Translations were evaluated using three standard linguistic metrics: BLEU score (accuracy at the word-phrase level), TER (error rate — how many edits are needed to reach a human translation), and chrF++ (similarity at both character and word level).

Who won? GPT-4 reigns, but no universal solution exists

Overall, the best results were achieved by GPT-4, which excelled particularly in translating from English to German (BLEU 35.0), Greek (32.6), Thai (53.2), and Turkish (35.5). GPT-3.5 was best for English-to-French translation (BLEU 55.4), Qwen1.5 dominated in the English-to-Chinese direction (BLEU 45.7), and Mixtral 8x22B shone in Italian-to-English translation (BLEU 63.9). The study's key finding is: there is no universal model that is best for all languages. LLM performance strongly depends on what data it was trained on. Qwen1.5 excelled in Chinese precisely because it was trained on more than 2.2 billion tokens predominantly in English and Chinese. Models with predominantly English training, on the other hand, failed with languages having different structures — for example, Yi-34B achieved a BLEU score of just 4.1 out of 100 when translating into Greek. The difference in translation direction is also interesting. Translation into English was generally more accurate than translation from English, which researchers attribute to the structural similarity of English with Romance languages and the overall English bias of most models.

Qualitative assessment: Comprehensibility yes, terminology lags

In addition to automatic metrics, radiologists also conducted a qualitative assessment on a five-point Likert scale across five criteria. The results were encouraging in many respects — but also revealed a fundamental weakness. Models achieved very good ratings in the categories of comprehensibility and readability (median 4.0 out of 5) and consistency with original meaning (4.2). The worst performer was accuracy of medical terminology with a median of just 3.4 — a grade of "good," not "excellent." And it's precisely inaccuracies in specialized terminology that are the most dangerous in medicine. Confusing terms like "malignant" and "benign" or "fracture" and "fissure" can have fatal consequences. The study authors explicitly warn that none of the tested models is approved for medical use and the results are purely experimental. In supplementary materials, they provide specific examples of dangerous translation errors across different languages.

What this means for the Czech Republic and Europe

For Czech patients and healthcare facilities, this study has several practical implications: Cross-border care in the EU. Within the European Union, you have the right to planned healthcare in another member state. If you get an MRI in Germany and bring the report to a Czech doctor, the language barrier is obvious. LLM translators could provide quick initial orientation — but with the understanding that the final word must come from a qualified physician. Refugee healthcare. The Czech Republic has repeatedly encountered influxes of patients speaking Ukrainian, Vietnamese, or Arabic in recent years. Automated translation of medical reports could significantly speed up diagnosis and reduce the burden on healthcare staff. Unfortunately, none of the ten tested models was evaluated for these specific language pairs — the study focused on nine languages, among which Ukrainian, Vietnamese, and Arabic did not appear. EU AI Act. From August 2026, the European AI regulation is fully in force, classifying AI systems in healthcare as high-risk. This means that any LLM deployed for translating medical reports will have to undergo strict certification and meet requirements for transparency, accuracy, and human oversight. Without that, it won't make it into European — and therefore Czech — hospitals.

Current state in June 2026: Where we've moved

The study tested models available in the first half of 2024. Since then, there has been a significant shift that gives the results new context: GPT-4o, GPT-5.5, and more. OpenAI models have gone through several generations of improvements. GPT-4o brought native multimodal capabilities and better multilingual support, while GPT-5.5 significantly improved context understanding and specialized terminology. It can be assumed that current models would perform even better at translating medical reports — although no comparable study has yet verified this. Claude 3.5, 4, and Opus 4.8. Anthropic's Claude excels at understanding subtle nuances and context — which is crucial for medical translation. Claude Opus 4.8 additionally brought the ability to admit uncertainty ("I'm not sure about this"), which is far safer in medicine than a confident hallucination. European specialized models. The research community is working on domain-specific medical language models. The Medical mT5 project is training models for the medical domain in several European languages including French, Italian, and Spanish. For Czech, a similar specialized model does not yet exist — this is an opportunity for Czech AI research, for example building on the newly established Czech AI Factory in Ostrava.

Practical use: When will it be safe?

Experts agree that the path to safely deploying LLM translators in medicine will go through three key steps: Fine-tuning on medical data. General models trained on internet texts lack deep understanding of medical terminology. Specialized fine-tuning on corpora of medical texts — ideally multilingual — can significantly increase the accuracy of specialized terminology. Human oversight as standard. Even the best model will need review by a qualified physician. The ideal scenario is "AI proposes the translation, a human approves" — similar to autonomous vehicles where the driver still holds the wheel. The Radiology study clearly showed that even the best models produce terminological errors. Regulatory framework. Without certification under the EU AI Act and the Medical Device Regulation (MDR), LLM translators will not make it into European hospitals. And that's correct — human health is at stake.

Conclusion: Huge potential, but still with reservations

Large language models have demonstrated that they can translate radiology reports with surprising accuracy. GPT-4 and other large models handle most languages significantly better than smaller open-source alternatives. But medicine isn't an e-shop — a translation error can have fatal consequences. The Radiology study is an important milestone that shows the way forward. It confirms that LLMs have the potential to help millions of patients overcome language barriers in access to healthcare. At the same time, it clearly states: for now, this is an experiment, not a tool for clinical practice. Before we see LLM translators in Czech hospitals, there's still a lot of work ahead — on models, data, and regulation.

Can a patient have their medical report translated by ChatGPT or another LLM on their own?

Technically yes, but we definitely do not recommend it. Regular LLM models are not certified for medical use and, according to the Radiology study, they produce errors in specialized terminology — even the best ones. Moreover, patients may not recognize the error. If you need a medical report translated, contact a professional translator specializing in medicine. An LLM can only serve for rough orientation, never as a basis for medical decisions.

Which languages are most problematic for LLM translation of medical texts?

According to the study, the most problematic are so-called low-resource languages — that is, languages for which little training data exists. The worst results were achieved by models when translating into Greek (Yi-34B BLEU score of just 4.1) and Thai. Czech is also problematic, falling among medium-resource languages — there are still very few models trained on high-quality Czech medical texts. Generally, the more training data that exists in a given language, the better the LLM translation will be.

Will AI translators in hospitals ever be free, or will hospitals have to buy them at high cost?

It depends on the model. Open-source models like Llama or Mixtral can be run for free on one's own infrastructure, which is attractive for hospitals that want to keep data under their control. Commercial models (GPT, Claude) are paid based on the volume of translated text. Given the sensitivity of healthcare data, it can be expected that European hospitals will prefer locally-run open-source models certified under the EU AI Act — but this is not yet happening anywhere. The first pilot projects in the EU can be expected within a 2–3 year horizon.

X

Don't miss out!

Subscribe for the latest news and updates.