Department of Laboratory Medicine, University of California San Francisco, San Francisco, CA, USA, United States
Background/Case Studies: Accurate classification of transfusion reactions is essential for patient care, donor management, and hemovigilance, yet significant variability exists even among transfusion medicine (TM) specialists. Generative large language models (LLMs) have the potential to facilitate improved transfusion reaction assessments and vary in their training data and reasoning capabilities. This study evaluated the accuracy of multiple state-of-the-art LLMs in classifying post-transfusion events and compared the results to the previously-published performance of TM specialists.
Study
Design/Methods: Seven advanced LLMs (ChatGPT-4o, -o3, -o4-mini; Claude 3.7 Sonnet, 3 Opus; Gemini 2.5 Flash, 2.5 Pro) were independently prompted to classify 36 transfusion scenarios from the AABB validation study of the CDC’s NHSN Hemovigilance Module (AuBuchon et al., Transfusion, 2014). Each scenario was submitted to a fresh instance to avoid prompt chaining. Models were not given NHSN criteria or told that some cases may not meet classification thresholds. Responses were scored against AABB expert labels and compared to 22 transfusion medicine specialists across four NHSN parameters: reaction type, case definition, severity, and imputability.
Results/Findings: Compared to the published 72.1% overall accuracy in transfusion reaction classification among TM specialists, the seven LLMs averaged 62.9% accuracy (95% CI, 57.5–68.3%; range 51.4–73.0%; Table 1). On average, LLMs outperformed TM specialists in identifying allergic reactions (+21.6%), transfusion-associated circulatory overload (+27.9%), transfusion-related acute lung injury (13.2%), and delayed serologic transfusion reactions (+15.0%); however, they were less accurate, on average, at classifying febrile non-hemolytic transfusion reactions (–15.2%) and non-reaction cases (–44.0%). TM specialists and LLMs performed similarly for scenarios adjudicated as acute hemolytic transfusion reactions, hypotensive transfusion reactions, transfusion-associated dyspnea, delayed hemolytic transfusion reactions, and transfusion-associated graft-versus-host disease. On average, LLMs were less accurate than TM specialists when assessing case definition (53.5% vs 76.5%), severity (48.3% vs 72.5%), and imputability (35.5% vs 64.4%).
Conclusions: This study demonstrates that recent LLM iterations hold both promise and important limitations with classification of post-transfusion adverse events. The seven LLMs compared in this investigation did not perform equally when classifying previously-published case scenarios; however, they were overall less accurate than TM specialists at applying NHSN nomenclature and identifying non-reaction scenarios. These findings set the stage for future research, including assessing the training, validity, and utility of LLMs in the appraisal of real-world cases. Continued rigorous study of LLM applications in TM practice is needed.