Purpose: To assess the efficacy of various machine learning (ML) algorithms in predicting late-stage colorectal cancer (CRC) diagnoses against the backdrop of socio-economic and regional healthcare disparities. Methods: An innovative theoretical framework was developed to integrate individual- and census tract-level social determinants of health (SDOH) with sociodemographic factors. A comparative analysis of the ML models was conducted using key performance metrics such as AUC-ROC to evaluate their predictive accuracy. Spatio-temporal analysis was used to identify disparities in late-stage CRC diagnosis probabilities. Results: Gradient boosting emerged as the superior model, with the top predictors for late-stage CRC diagnosis being anatomic site, year of diagnosis, age, proximity to superfund sites, and primary payer. Spatio-temporal clusters highlighted geographic areas with a statistically significant high probability of late-stage diagnoses, emphasizing the need for targeted healthcare interventions. Conclusions: This research underlines the potential of ML in enhancing the prognostic predictions in oncology, particularly in CRC. The gradient boosting model, with its robust performance, holds promise for deployment in healthcare systems to aid early detection and formulate localized cancer prevention strategies. The study’s methodology demonstrates a significant step toward utilizing AI in public health to mitigate disparities and improve cancer care outcomes.
目的:评估不同机器学习算法在预测晚期结直肠癌诊断中的效能,并探讨社会经济与区域医疗差异对其预测能力的影响。方法:构建创新理论框架,整合个体层面与人口普查区域层面的健康社会决定因素及社会人口学变量。采用AUC-ROC等关键性能指标对机器学习模型进行对比分析,评估其预测准确性。通过时空分析识别晚期结直肠癌诊断概率的差异分布。结果:梯度提升模型表现最优,其预测晚期结直肠癌诊断的关键因素包括解剖部位、诊断年份、年龄、与超级基金污染场地的邻近程度及主要支付方。时空聚类分析揭示出具有统计学显著性的晚期诊断高概率地理区域,凸显了针对性医疗干预的必要性。结论:本研究证实了机器学习在肿瘤预后预测,特别是结直肠癌领域的应用潜力。梯度提升模型凭借其稳健性能,有望在医疗系统中部署以辅助早期检测并制定区域化癌症防治策略。该研究方法为利用人工智能技术缓解医疗差异、改善癌症治疗结果提供了重要实践路径。