Background:Improving prediction models to timely detect lung cancer is paramount. Our aim is to develop and validate prediction models for early detection of lung cancer in primary care, based on free-text consultation notes, that exploit the order and context among words and sentences.Methods:Data of all patients enlisted in 49 general practices between 2002 and 2021 were assessed, and we included those older than 30 years with at least one free-text note. We developed two models using a hierarchical architecture that relies on attention and bidirectional long short-term memory networks. One model used only text, while the other combined text with clinical variables. The models were trained on data excluding the five months leading up to the diagnosis, using target replication and a tuning set, and were tested on a separate dataset for discrimination, PPV, and calibration.Results:A total of 250,021 patients were enlisted, with 1507 having a lung cancer diagnosis. Included in the analysis were 183,012 patients, of which 712 had the diagnosis. From the two models, the combined model showed slightly better performance, achieving an AUROC on the test set of 0.91, an AUPRC of 0.05, and a PPV of 0.034 (0.024, 0.043), and showed good calibration. To early detect one cancer patient, 29 high-risk patients would require additional diagnostic testing.Conclusions:Our models showed excellent discrimination by leveraging the word and sentence structure. Including clinical variables in addition to text slightly improved performance. The number needed to treat holds promise for clinical practice. Investigating external validation and model suitability in clinical practice is warranted.

摘要翻译：

背景：改进预测模型以早期发现肺癌至关重要。本研究旨在基于自由文本诊疗记录，开发并验证用于初级保健中肺癌早期检测的预测模型，该模型充分利用了词语和句子间的顺序与上下文信息。方法：我们评估了2002年至2021年间49家全科诊所登记的所有患者数据，纳入年龄超过30岁且至少有一条自由文本记录的患者。采用基于注意力机制和双向长短期记忆网络的分层架构开发了两个模型：一个仅使用文本数据，另一个结合文本与临床变量。模型训练排除了诊断前五个月的数据，采用目标复制和调优集策略，并在独立数据集上测试了区分度、阳性预测值和校准度。结果：共纳入250,021名登记患者，其中1507例确诊肺癌。最终分析包含183,012名患者（含712例确诊）。双模型比较显示，结合临床变量的模型性能更优：测试集受试者工作特征曲线下面积达0.91，精确召回曲线下面积为0.05，阳性预测值0.034（95% CI: 0.024-0.043），且校准良好。每早期发现1例肺癌患者，需要对29例高风险患者进行额外诊断检测。结论：通过有效利用词语和句子结构，我们的模型展现出卓越的区分能力。在文本基础上加入临床变量可小幅提升性能。所需治疗人数指标在临床实践中具有应用潜力。未来需进一步开展外部验证研究并评估模型在临床实践中的适用性。

原文链接：

An Order-Sensitive Hierarchical Neural Model for Early Lung Cancer Detection Using Dutch Primary Care Notes and Structured Data

……

文章目录

文章：

基于荷兰初级保健记录与结构化数据的早期肺癌检测：一种顺序敏感的分层神经网络模型

An Order-Sensitive Hierarchical Neural Model for Early Lung Cancer Detection Using Dutch Primary Care Notes and Structured Data

原文发布日期：29 March 2025

DOI: 10.3390/cancers17071151

类型: Article

开放获取: 是

英文摘要：

摘要翻译：

原文链接：

相关文章

关于我们

官方邮箱

商务合作