Curated Top Poster
Information Technology and Informatics
Sehyo Yune, MD, MPH, MBA (she/her/hers)
Beth Israel Deaconess Medical Center
Boston, Massachusetts
Disclosure information not submitted.
From March to April 2025, we presented the 36 hypothetical clinical cases used in the previous study to a local institutional chatbot (UW DLMP v0.9.3) using the GPT-4o LLM. Three different prompts were used, and each prompt presented the same case twice, resulting in a total of 6 sets of responses for each case. The three approaches are shown in table 1; prompt #1 asked for the case definition criteria, severity, and imputability using NHSN criteria, prompt #2 asked the same after providing the plain texts from the NHSN protocol, prompt #3 provided each classification category and asked to strictly use the NHSN protocol as well as providing the protocol text.
Each response was obtained in a new chat to prevent the model from learning from prior prompts. The responses were compared to an expert panel’s classifications to assess accuracy. They were also compared to the published responses of a panel of transfusion medicine (TM) specialists as well as the result of the previous study. The case definition, severity, and imputability of each response were evaluated for its compliance with the NHSN criteria. Python 3.9.19 was used to run chi-square tests and obtain p-values to compare accuracies.
Overall, the 6 attempts achieved an average of 61.7% accuracy (137/222), compared to 48.6% shown in the previous study (p=0.19) and 72.1% by TM specialists (p=0.04). The transfusion-associated dyspnea category showed the lowest accuracy at 33.3%, while the acute/delayed hemolytic transfusion reaction category had the highest accuracy at 96.7%, consistent with the lowest and highest accuracy in the same category among TM specialists at 36.4% and 81.8%, respectively.
Prompt #1 achieved accuracy of 59.5%, compared to 62.2% by prompt #2, and 63.5% by prompt #3 (p=0.88). Prompt #2 and #3 showed 100% compliance with the NHSN verbatim across case definition, severity, and imputabiilty, while Prompt #1 had compliance of 83% for case definition, 36% for severity, and 94% for imputability.