Background/Objectives:Large language models (LLMs) have been proposed as a means of converting unstructured electronic medical records (EMRs) into structured datasets. However, concerns regarding the reliability of these models in non-English clinical text and their capacity to generate novel insights remain unresolved. We aimed to utilize an LLM to identify a hypothetical “Luminal B poor-prognosis” breast cancer subgroup (LPP) based on progesterone receptor (PR), the Ki-67 proliferation index, and grade characteristics, while concurrently validating the LLM’s accuracy.Methods:We retrospectively compiled the EMRs on 7756 female breast cancer patients from five Moscow oncology centers. An LLM with a domain-engineered prompt extracted eight clinicopathological variables (Ki-67, estrogen receptor (ER)/PR Allred status, HER2 status, grade, relapse dates, and multiple primaries). The accuracy of the model was validated in 366 randomly sampled cases against oncologist annotations using Intraclass Correlation Coefficient (ICC) and weighted κ. Following data post-processing, the complete-case cohort (n= 2347) and the HR+/HER2− stage I–III sub-cohort (n= 1419) were analyzed. Survival was estimated with Kaplan–Meier/log-rank and modeled with Cox regression (adjusted for age, stage, and treatment). Ki-67 was modeled continuously; prespecified LPP definitions were compared.Results:LLM–human agreement was high (Ki-67 ICC = 0.882; grade κ = 0.887; ER κ = 0.997; PR κ = 0.975; HER2 κ = 0.935). Date extraction was characterized by a high degree of missing data. In HR+/HER2− stage I–III disease, ER < 5 was non-prognostic; however, PR < 4 and Ki-67 ≥ 40% were indicative of inferior survival (HR 2.25 and 1.85). The most effective LPP definition (PR < 4 and Ki-67 ≥ 40%) identified a subgroup (~5.3%) of patients with markedly poorer outcomes (age, stage, and treatment adjusted HR 2.60, 95% CI 1.53–4.43) compared to the Luminal B (HER2−) subgroup.Conclusions:The developed LLM has demonstrated the ability to reliably structure non-English EMRs and enable discovery of clinically meaningful subgroups. The discovered LPP phenotype defines a small, high-risk subset warranting external validation. Given the retrospective, single-system design of the study, it is imperative to interpret the discovered phenotype features as hypothesis-generating, rather than as definitive evidence.

摘要翻译：

背景/目的：大型语言模型（LLM）已被提议作为将非结构化电子病历（EMR）转化为结构化数据集的一种手段。然而，关于这些模型在非英语临床文本中的可靠性及其生成新见解的能力，其担忧仍未解决。本研究旨在利用LLM，基于孕激素受体（PR）、Ki-67增殖指数和分级特征，识别一个假设的“Luminal B型预后不良”乳腺癌亚组（LPP），同时验证LLM的准确性。方法：我们回顾性收集了来自莫斯科五家肿瘤中心的7756例女性乳腺癌患者的EMR。采用一个经过领域工程提示设计的LLM，提取了八个临床病理学变量（Ki-67、雌激素受体（ER）/PR Allred评分状态、HER2状态、分级、复发日期以及多原发癌）。在366例随机抽样病例中，通过组内相关系数（ICC）和加权κ系数，以肿瘤学家标注为金标准验证了模型的准确性。经过数据后处理，对完整病例队列（n=2347）和HR+/HER2− I–III期亚队列（n=1419）进行了分析。使用Kaplan–Meier法和log-rank检验估计生存率，并采用Cox回归模型（校正年龄、分期和治疗）进行建模。Ki-67作为连续变量建模；对预先设定的LPP定义进行了比较。结果：LLM与人工标注的一致性很高（Ki-67 ICC = 0.882；分级 κ = 0.887；ER κ = 0.997；PR κ = 0.975；HER2 κ = 0.935）。日期提取存在较高的数据缺失。在HR+/HER2− I–III期疾病中，ER < 5无预后意义；然而，PR < 4和Ki-67 ≥ 40%提示生存较差（HR分别为2.25和1.85）。最有效的LPP定义（PR < 4且Ki-67 ≥ 40%）识别出一个约占5.3%的患者亚组，与Luminal B（HER2−）亚组相比，其预后显著更差（校正年龄、分期和治疗后的HR为2.60，95% CI 1.53–4.43）。结论：所开发的LLM已证明能够可靠地结构化非英语EMR，并有助于发现具有临床意义的亚组。所发现的LPP表型定义了一个小规模的高风险亚组，值得进行外部验证。鉴于本研究为回顾性、单一系统设计，必须将所发现的表型特征视为假设生成，而非确定性证据。

原文链接：

Deriving Real-World Evidence from Non-English Electronic Medical Records in Hormone Receptor-Positive Breast Cancer Using Large Language Models

……

文章目录

文章：

利用大型语言模型从非英语电子病历中提取激素受体阳性乳腺癌的真实世界证据

Deriving Real-World Evidence from Non-English Electronic Medical Records in Hormone Receptor-Positive Breast Cancer Using Large Language Models

原文发布日期：29 November 2025

DOI: 10.3390/cancers17233836

类型: Article

开放获取: 是

英文摘要：

摘要翻译：

原文链接：

相关文章

关于我们

官方邮箱

商务合作