Introduction: Large Language Models (LLMs), such as the GPT model family from OpenAI, have demonstrated transformative potential across various fields, especially in medicine. These models can understand and generate contextual text, adapting to new tasks without specific training. This versatility can revolutionize clinical practices by enhancing documentation, patient interaction, and decision-making processes. In oncology, LLMs offer the potential to significantly improve patient care through the continuous monitoring of chemotherapy-induced toxicities, which is a task that is often unmanageable for human resources alone. However, existing research has not sufficiently explored the accuracy of LLMs in identifying and assessing subjective toxicities based on patient descriptions. This study aims to fill this gap by evaluating the ability of LLMs to accurately classify these toxicities, facilitating personalized and continuous patient care. Methods: This comparative pilot study assessed the ability of an LLM to classify subjective toxicities from chemotherapy. Thirteen oncologists evaluated 30 fictitious cases created using expert knowledge and OpenAI’s GPT-4. These evaluations, based on the CTCAE v.5 criteria, were compared to those of a contextualized LLM model. Metrics such as mode and mean of responses were used to gauge consensus. The accuracy of the LLM was analyzed in both general and specific toxicity categories, considering types of errors and false alarms. The study’s results are intended to justify further research involving real patients. Results: The study revealed significant variability in oncologists’ evaluations due to the lack of interaction with fictitious patients. The LLM model achieved an accuracy of 85.7% in general categories and 64.6% in specific categories using mean evaluations with mild errors at 96.4% and severe errors at 3.6%. False alarms occurred in 3% of cases. When comparing the LLM’s performance to that of expert oncologists, individual accuracy ranged from 66.7% to 89.2% for general categories and 57.0% to 76.0% for specific categories. The 95% confidence intervals for the median accuracy of oncologists were 81.9% to 86.9% for general categories and 67.6% to 75.6% for specific categories. These benchmarks highlight the LLM’s potential to achieve expert-level performance in classifying chemotherapy-induced toxicities. Discussion: The findings demonstrate that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM achieved 85.7% accuracy in general categories and 64.6% in specific categories. While the model’s general category performance falls within expert ranges, specific category accuracy requires improvement. The study’s limitations include the use of fictitious cases, lack of patient interaction, and reliance on audio transcriptions. Nevertheless, LLMs show significant potential for enhancing patient monitoring and reducing oncologists’ workload. Future research should focus on the specific training of LLMs for medical tasks, conducting studies with real patients, implementing interactive evaluations, expanding sample sizes, and ensuring robustness and generalization in diverse clinical settings. Conclusions: This study concludes that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM’s performance in general toxicity categories is within the expert range, but there is room for improvement in specific categories. LLMs have the potential to enhance patient monitoring, enable early interventions, and reduce severe complications, improving care quality and efficiency. Future research should involve specific training of LLMs, validation with real patients, and the incorporation of interactive capabilities for real-time patient interactions. Ethical considerations, including data accuracy, transparency, and privacy, are crucial for the safe integration of LLMs into clinical practice.
引言:以OpenAI的GPT系列为代表的大型语言模型(LLM)已在多个领域展现出变革性潜力,尤其在医学领域表现突出。这类模型能够理解并生成上下文关联文本,无需专门训练即可适应新任务。其多功能性可通过优化医疗文书记录、改善医患互动及辅助临床决策,彻底革新临床实践模式。在肿瘤学领域,LLM通过持续监测化疗所致毒性反应(这一任务常超出人力资源负荷极限),有望显著提升患者照护质量。然而现有研究尚未充分探讨LLM基于患者描述识别与评估主观毒性反应的准确性。本研究旨在通过评估LLM对此类毒性反应的精准分类能力,填补这一研究空白,从而促进个性化、连续性的患者照护。 方法:本项探索性对照研究评估了LLM对化疗主观毒性反应的分类能力。13位肿瘤学专家对30例基于专业知识与OpenAI GPT-4构建的模拟病例进行评估。依据CTCAE v.5标准,将专家评估结果与情境化LLM模型的判断进行对比,采用应答众数与均值等指标衡量共识度。研究从总体毒性类别与具体毒性类别两个维度分析LLM准确性,同时考量错误类型与误报情况。本研究结果将为开展真实患者研究提供依据。 结果:研究发现由于缺乏与模拟患者的实际互动,肿瘤学专家的评估存在显著差异。LLM模型在总体毒性类别中准确率达85.7%,在具体毒性类别中为64.6%(基于均值评估),其中轻度错误率96.4%,重度错误率3.6%。误报发生率为3%。将LLM表现与肿瘤学专家个体对比发现:总体类别准确率区间为66.7%-89.2%,具体类别为57.0%-76.0%。肿瘤学专家中位准确率的95%置信区间在总体类别为81.9%-86.9%,具体类别为67.6%-75.6%。这些基准数据表明LLM在化疗毒性分类任务中具备达到专家级水平的潜力。 讨论:研究结果表明LLM对化疗主观毒性反应的分类准确度可与肿瘤学专家相媲美。LLM在总体类别准确率达85.7%,具体类别为64.6%。虽然模型在总体类别表现处于专家区间范围内,但具体类别准确率仍需提升。研究局限性包括使用模拟病例、缺乏真实医患互动、依赖音频转录文本等。尽管如此,LLM在增强患者监测、减轻肿瘤医生工作负荷方面展现出显著潜力。未来研究应聚焦于:针对医疗任务对LLM进行专项训练、开展真实患者研究、实施交互式评估、扩大样本规模、确保模型在不同临床环境中的鲁棒性与泛化能力。 结论:本研究证实LLM对化疗主观毒性反应的分类准确度可达到肿瘤学专家水平。LLM在总体毒性类别表现处于专家区间,但在具体类别仍有提升空间。LLM有望增强患者监测能力、实现早期干预、减少严重并发症,从而提升照护质量与效率。未来研究应包含LLM专项训练、真实患者验证、整合实时交互功能等方向。数据准确性、透明度与隐私保护等伦理考量,对LLM安全融入临床实践至关重要。