Skip to main content Skip to main navigation

Publication

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan

Hirotaka Takita; Shannon L. Walston; Yasuhito Mitsuyama; Ko Watanabe; Shoya Ishimaru; Daiju Ueda
In: Japanese Journal of Radiology, Vol. 43, Pages 1445-1455, Springer, 5/2025.

Abstract

Purpose To compare the diagnostic performance of three proprietary large language models (LLMs)—Claude, GPT, and Gemini—in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to evaluate the impact of three prompting approaches on model accuracy. Materials and Methods In this retrospective study, head CT reports from the Japan Medical Imaging Database collected between 2018 and 2023 were analyzed. Two board-certified radiologists independently reviewed each case and established the ground truth for intracranial hemorrhage and skull fractures through consensus. Each report was processed by three LLMs using three prompting strategies—Standard, Chain-of-Thought, and Self-Consistency prompting. Diagnostic performance metrics (accuracy, precision, recall, and F1 score) were calculated for each LLM–prompt combination and compared using McNemar’s tests with Bonferroni correction. Misclassified cases were further examined through qualitative error analysis. Results A total of 3,949 head CT reports from 3,949 patients (mean age, 59 ± 25 years; 56.2% male) were included. Among them, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM–prompt combinations achieved high diagnostic accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage compared with GPT and Gemini, and also outperformed Gemini in detecting skull fractures (p < 0.0001). Gemini showed notable performance improvement with Chain-of-Thought prompting. Error analysis identified common challenges, including ambiguous phrasing and findings unrelated to intracranial hemorrhage or skull fractures, highlighting the importance of careful prompt design. Conclusion All three proprietary LLMs showed strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. Although prompting strategies influenced diagnostic accuracy, all models demonstrated robust potential for clinical and research applications. Future studies should focus on prompt refinement and prospective validation in multilingual clinical settings.

Projects