肿瘤(癌症)患者之家
首页
癌症知识
肿瘤中医药治疗
肿瘤药膳
肿瘤治疗技术
前沿资讯
临床试验招募
登录/注册
VIP特权
广告
广告加载中...

文章:

从不平衡数据中学习:整合先进重采样技术与机器学习模型以提升癌症诊断与预后准确性

Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

原文发布日期:8 October 2024

DOI: 10.3390/cancers16193417

类型: Article

开放获取: 是

 

英文摘要:

Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.

 

摘要翻译: 

背景/目的:本研究旨在评估多种分类算法与重采样方法在多个癌症诊断及预后数据集上的性能表现,以应对类别不平衡带来的挑战。方法:共分析五个数据集,包括三个诊断数据集(威斯康星乳腺癌数据库、癌症预测数据集、肺癌检测数据集)和两个预后数据集(SEER乳腺癌数据集、分化型甲状腺癌复发数据集)。研究采用三大类共十九种重采样方法,并选取四大类中的十种分类器进行比较。结果:结果表明,混合采样方法(尤其是SMOTEENN)取得了最高平均性能(98.19%),其次为IHT(97.20%)和RENN(96.48%)。在分类器方面,随机森林以94.69%的平均值表现最佳,平衡随机森林与XGBoost紧随其后。基线方法(未使用重采样)的性能显著较低(91.33%),凸显了重采样技术在提升模型效果方面的有效性。结论:本研究强调了重采样方法在提升不平衡数据集分类性能中的重要性,为研究人员和医疗从业者提供了有价值的参考。研究结果为未来整合机器学习技术于癌症诊断与预后的研究奠定了基础,并对混合模型与临床应用的进一步研究提出了建议。

 

原文链接:

Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

广告
广告加载中...