Background: Colorectal cancer (CRC) remains a major cause of cancer-related mortality worldwide, underscoring the need for more efficient and resource-conscious screening strategies.Methods: We screened 51,437 individuals (50–74 y) in South-West Oltenia, Romania, with FIT values of ≥20 µg Hb/g. Of the 2825 FIT-positive individuals, 1550 completed colonoscopy, and we recorded their age, sex, residence, education, comorbidities, medications, and FIT values. After imputing < 8% missing data via multiple imputation, we reduced dimensionality with an autoencoder (ReLU, dropout 0.5, L2, 100 epochs, batch 32) and applied K-Means clustering (k = 5). The following are examples of actionable clusters: Cluster 0 (“High-FIT malignant”): FIT > 200 µg/g, age > 65, diabetes; Cluster 2 (“Low-risk mixed”): FIT 100–199 µg/g, age < 60, no comorbidities; Cluster 3 (“Intermediate-risk older”): FIT 150–200 µg/g, ≥3 comorbidities, rural. Cluster labels were then predicted by a feed-forward neural network (64–32 neurons, dropout 0.6) and validated via 5-fold cross-validation plus a temporal hold-out.Results: Five distinct patient clusters were identified, enabling the development of a composite risk score. Notably, Cluster 0, characterized by elevated FIT levels, exhibited a malignancy rate of 50.91%, while the overall CRC diagnostic rate among colonoscoped patients was approximately 13.87%. This stratification model enhances the diagnostic yield by prioritizing high-risk patients for urgent colonoscopy and sparing low-risk individuals from unnecessary invasive procedures.Conclusions: The AI-driven composite risk score offers a refined framework for CRC risk stratification and optimized resource allocation. Its implementation can lead to earlier detection of advanced lesions, thereby improving patient outcomes. Further external validation on independent cohorts and regions is essential to confirm its broad utility, with potential future integration of additional biomarkers (e.g., genetic or omics-based indicators) to further enhance predictive accuracy.
背景:结直肠癌(CRC)仍是全球癌症相关死亡的主要原因,凸显了对更高效且节约资源的筛查策略的需求。 方法:我们在罗马尼亚西南部奥尔泰尼亚地区对51,437名年龄在50至74岁之间、粪便免疫化学检测(FIT)值≥20 µg Hb/g的个体进行了筛查。在2825名FIT阳性者中,1550人完成了结肠镜检查,我们记录了他们的年龄、性别、居住地、教育水平、合并症、用药情况及FIT值。通过多重填补法处理了<8%的缺失数据后,使用自编码器(激活函数ReLU,丢弃率0.5,L2正则化,训练100轮,批次大小32)进行降维,并应用K均值聚类(k=5)。以下是可操作聚类示例:聚类0(“高FIT恶性组”):FIT>200 µg/g、年龄>65岁、患有糖尿病;聚类2(“低风险混合组”):FIT 100–199 µg/g、年龄<60岁、无合并症;聚类3(“中风险老年组”):FIT 150–200 µg/g、≥3种合并症、农村居住。随后通过前馈神经网络(64–32神经元,丢弃率0.6)预测聚类标签,并采用5折交叉验证加时间保留验证进行验证。 结果:研究识别出五个不同的患者聚类,据此构建了复合风险评分模型。值得注意的是,以FIT水平升高为特征的聚类0恶性病变检出率达50.91%,而所有接受结肠镜检查患者的总体CRC诊断率约为13.87%。该分层模型通过优先安排高风险患者接受紧急结肠镜检查,同时避免低风险个体接受不必要的侵入性操作,从而提高了诊断效率。 结论:基于人工智能的复合风险评分为CRC风险分层和资源优化配置提供了精细化框架。该模型的实施有助于更早发现进展期病变,从而改善患者预后。未来需在独立队列和地区进行外部验证以确认其广泛适用性,并有望整合更多生物标志物(如遗传或组学指标)以进一步提升预测准确性。