Examining LLM Assistance in Brain Tumor MRI Interpretation

02/24/2026
A recent Journal of Clinical Medicine study compared two large language models—ChatGPT‑4o and Claude 3.5 Sonnet—with human readers interpreting brain tumor MRI, using 127 histologically confirmed brain tumor cases.
The comparison included two board-certified neuroradiologists and three radiology trainees, and it also tested a second phase in which readers re-evaluated cases after seeing model-provided differentials.
In a constrained reading environment, each case was represented by a single slide of selected “representative” MRI images. Both LLMs received that slide plus an accompanying structured radiologic report; the radiologists reviewed the same image slide but without the structured report and without clinical information beyond age and sex. All participants generated up to three differential diagnoses per case, and investigators scored outputs two ways: whether the single top-ranked diagnosis matched the histologic reference standard and whether the correct diagnosis appeared anywhere in the three-item list. In this scoring scheme, “primary diagnosis” accuracy reflects top-choice correctness, while “top-three differential” accuracy reflects whether the reference diagnosis was included in the differential.
Across all cases, Claude 3.5 Sonnet achieved primary diagnosis accuracy of 50.4% and top-three differential accuracy of 85.0%, while ChatGPT‑4o achieved 44.9% and 82.7%, respectively; the article reports that these between-model differences were not statistically significant. In the same comparison set, board-certified radiologists were reported to have higher primary diagnosis accuracy than the LLMs, while their top-three differential accuracy (80.7%) was described as similar to LLM performance (82.7–85.0%).
When readers repeated the task after being shown LLM-generated differential diagnoses, there were within-reader improvements that differed by experience level. For trainees, primary diagnosis accuracy increased from 48.0% to 58.8% and top-three differential accuracy increased from 62.5% to 81.1%, with both changes reported as statistically significant. For board-certified radiologists, top-three differential accuracy increased from 80.7% to 90.2%.
The authors also outline cautions about how well these findings may generalize beyond the study setup. They note that the LLMs had access to structured report inputs that the radiologists did not, and that all readers—including the LLMs—worked from a single representative image slide rather than full volumetric MRI review, conditions they describe as limiting workflow realism. Additional limitations include potential “information leakage” from structured descriptors correlated with tumor type, use of a single inference per model without assessing response variability, and the absence of a formal multi-reader multi-case framework, alongside a case mix restricted to primary brain tumors.
