Background/Objectives: Lung cancer (LC) is the leading cause of cancer mortality, making early diagnosis essential. While LC screening trials are underway globally, optimal prediction models and inclusion criteria are still lacking. This study aimed to develop and evaluate Bayesian Network (BN) models for LC risk prediction using a decade of data from Denmark. The primary goal was to assess BN performance on datasets varying in size and completeness, simulate real-world screening scenarios, and identify the most valuable data sources for LC screening.Methods: The study included 38,944 patients evaluated for LC, with 11,284 (29%) diagnosed. Data on comorbidities, medications, and general practice were available for the entire cohort, while laboratory results, smoking habits, and other variables were only available for subsets. The cohort was divided into four subsets based on data availability, and BNs were trained and validated across these subsets using cross-validation and external validation. To determine the optimal combination of variables, all possible data combinations were evaluated on the samples that contained all the variables (n = 5587).Results: A model trained on the small, complete dataset (AUC 0.78) performed similarly on a larger dataset with 21% missing data (AUC 0.78). Performance dropped when 39% of data were missing (AUC 0.67), resulting in informative variables missing completely in the dataset. Laboratory results and smoking data were the most informative, significantly outperforming models based only on age and smoking status (AUC 0.70).Conclusions: BN models demonstrated moderate to strong predictive performance, even with incomplete data, highlighting the potential value of incorporating laboratory results in LC screening programs.

摘要翻译：

背景/目的：肺癌是癌症死亡的首要原因，早期诊断至关重要。尽管全球范围内正在进行肺癌筛查试验，但最佳预测模型和纳入标准仍然缺乏。本研究旨在利用丹麦十年数据开发并评估用于肺癌风险预测的贝叶斯网络模型。主要目标是评估贝叶斯网络在不同规模和完整性的数据集上的性能，模拟真实世界筛查场景，并确定肺癌筛查中最有价值的数据来源。方法：本研究纳入38,944名接受肺癌评估的患者，其中11,284名（29%）确诊。所有患者均获得合并症、用药和全科诊疗数据，而实验室结果、吸烟习惯等变量仅部分患者可获得。根据数据可获得性将队列分为四个子集，并采用交叉验证和外部验证方法在这些子集上训练和验证贝叶斯网络模型。为确定最佳变量组合，在包含所有变量的样本（n=5587）上评估了所有可能的数据组合。结果：在小型完整数据集上训练的模型（AUC 0.78）在缺失21%数据的较大数据集上表现相当（AUC 0.78）。当数据缺失率达39%时性能下降（AUC 0.67），导致信息变量在数据集中完全缺失。实验室结果和吸烟数据最具信息价值，其模型表现显著优于仅基于年龄和吸烟状况的模型（AUC 0.70）。结论：贝叶斯网络模型即使在不完整数据条件下仍表现出中等至较强的预测性能，凸显了将实验室结果纳入肺癌筛查项目的潜在价值。

原文链接：

A Bayesian Network Approach to Lung Cancer Screening: Assessing the Impact of Data Quantity, Quality, and the Combination of Data from Danish Electronic Health Records

……

文章目录

文章：

贝叶斯网络在肺癌筛查中的应用：评估丹麦电子健康记录数据量、数据质量及数据整合的影响

A Bayesian Network Approach to Lung Cancer Screening: Assessing the Impact of Data Quantity, Quality, and the Combination of Data from Danish Electronic Health Records

原文发布日期：28 November 2024

DOI: 10.3390/cancers16233989

类型: Article

开放获取: 是

英文摘要：

摘要翻译：

原文链接：

相关文章

关于我们

官方邮箱

商务合作