Icon Legend

This session is not in your schedule.

This session is in your schedule. Click again to remove it.

Event Icons

Assessor Sessions

Committee Meetings

Foundation Events

Networking Events

Oral Abstracts

CABP Eligible

Video Recording

Curated Top Poster

Information Technology and Informatics

Curated Top Posters Tour: Novel Insights for Transfusion Practice

P-IT-1 - Impact of Prompt Engineering in Classification of Transfusion Reactions Using a Publicly Available Large Language Model

Monday, October 27, 2025

12:15 PM - 1:00 PM PT

Room: SDCC Exhibit Hall F-H, AABB Learning Theater

CE: N/A

Presenting Author(s)

SY

Sehyo Yune, MD, MPH, MBA (she/her/hers)

Beth Israel Deaconess Medical Center
Boston, Massachusetts

Disclosure information not submitted.

Background/Case Studies: A prior study investigated the ability of a popular large language model (LLM) to classify transfusion reactions using the Centers for Disease Control and Prevention (CDC) National Healthcare Safety Network (NHSN) Hemovigilance Module adverse events protocol. The current study aims to further evaluate performance for this task using a newer version of that model using prompt engineering.

Study

Design/Methods:

From March to April 2025, we presented the 36 hypothetical clinical cases used in the previous study to a local institutional chatbot (UW DLMP v0.9.3) using the GPT-4o LLM. Three different prompts were used, and each prompt presented the same case twice, resulting in a total of 6 sets of responses for each case. The three approaches are shown in table 1; prompt #1 asked for the case definition criteria, severity, and imputability using NHSN criteria, prompt #2 asked the same after providing the plain texts from the NHSN protocol, prompt #3 provided each classification category and asked to strictly use the NHSN protocol as well as providing the protocol text.

Each response was obtained in a new chat to prevent the model from learning from prior prompts. The responses were compared to an expert panel’s classifications to assess accuracy. They were also compared to the published responses of a panel of transfusion medicine (TM) specialists as well as the result of the previous study. The case definition, severity, and imputability of each response were evaluated for its compliance with the NHSN criteria. Python 3.9.19 was used to run chi-square tests and obtain p-values to compare accuracies.

Results/Findings:

Overall, the 6 attempts achieved an average of 61.7% accuracy (137/222), compared to 48.6% shown in the previous study (p=0.19) and 72.1% by TM specialists (p=0.04). The transfusion-associated dyspnea category showed the lowest accuracy at 33.3%, while the acute/delayed hemolytic transfusion reaction category had the highest accuracy at 96.7%, consistent with the lowest and highest accuracy in the same category among TM specialists at 36.4% and 81.8%, respectively.

Prompt #1 achieved accuracy of 59.5%, compared to 62.2% by prompt #2, and 63.5% by prompt #3 (p=0.88). Prompt #2 and #3 showed 100% compliance with the NHSN verbatim across case definition, severity, and imputabiilty, while Prompt #1 had compliance of 83% for case definition, 36% for severity, and 94% for imputability.

Conclusions: The newer version of GPT did not show statistically significant improvement in classifying transfusion reactions compared to the older study. Prompt engineering to directly provide the NHSN protocol and to strictly use the protocol improved compliance with the NHSN criteria, but no significant improvement in accuracy was observed.