TY - JOUR
T1 - Evaluation of Six Large Language Models for Clinical Decision Support
T2 - Application in Transfusion Decisionmaking for RhD Blood-type Patients
AU - Lee, Jong Kwon
AU - Choi, Sooin
AU - Park, Sholhui
AU - Hwang, Sang Hyun
AU - Cho, Duck
N1 - Publisher Copyright:
© Korean Society for Laboratory Medicine.
PY - 2025
Y1 - 2025
N2 - Background: Large language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare. Methods: Fifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated. Results: GPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated. Conclusions: GPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.
AB - Background: Large language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare. Methods: Fifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated. Results: GPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated. Conclusions: GPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.
KW - Clinical decision support
KW - Large language models (LLMs)
KW - RhD blood type
KW - Transfusion
UR - https://www.scopus.com/pages/publications/105013870805
U2 - 10.3343/alm.2024.0588
DO - 10.3343/alm.2024.0588
M3 - Article
C2 - 40289855
AN - SCOPUS:105013870805
SN - 2234-3806
VL - 45
SP - 520
EP - 529
JO - Annals of Laboratory Medicine
JF - Annals of Laboratory Medicine
IS - 5
ER -