Background: In recent years, microarray datasets have been used to store information about human genes and methods used to express the genes in order to successfully diagnose cancer disease in the early stages. However, most of the microarray datasets typically contain thousands of redundant, irrelevant, and noisy genes, which raises a great challenge for effectively applying the machine learning algorithms to these high-dimensional microarray datasets. Methods: To address this challenge, this paper introduces a proposed hybrid filter and differential evolution-based feature selection to choose only the most influential genes or features of high-dimensional microarray datasets to improve cancer diagnoses and classification. The proposed approach is a two-phase hybrid feature selection model constructed using selecting the top-ranked features by some popular filter feature selection methods and then further identifying the most optimal features conducted by differential evolution (DE) optimization. Accordingly, some popular machine learning algorithms are trained using the final training microarray datasets with only the best features in order to produce outstanding cancer classification results. Four high-dimensional cancerous microarray datasets were used in this study to evaluate the proposed method, which are Breast, Lung, Central Nervous System (CNS), and Brain cancer datasets. Results: The experimental results demonstrate that the classification accuracy results achieved by the proposed hybrid filter-DE over filter methods increased to 100%, 100%, 93%, and 98% on Brain, CNS, Breast and Lung, respectively. Furthermore, applying the suggested DE-based feature selection contributed to removing around 50% of the features selected by using the filter methods for these four cancerous microarray datasets. The average improvement percentages of accuracy achieved by the proposed methods were up to 42.47%, 57.45%, 16.28% and 43.57% compared to the previous works that are 41.43%, 53.66%, 17.53%, 61.70% on Brain, CNS, Lung and Breast datasets, respectively. Conclusions: Compared to the previous works, the proposed methods accomplished better improvement percentages on Brain and CNS datasets, comparable improvement percentages on Lung dataset, and less improvement percentages on Breast dataset.
背景:近年来,微阵列数据集被用于存储人类基因信息及其表达方法,以期在早期阶段成功诊断癌症。然而,大多数微阵列数据集通常包含数千个冗余、不相关且带有噪声的基因,这为将机器学习算法有效应用于这些高维微阵列数据集带来了巨大挑战。方法:为应对这一挑战,本文提出了一种基于混合过滤器和差分进化的特征选择方法,旨在从高维微阵列数据集中仅选取最具影响力的基因或特征,以提升癌症诊断与分类的准确性。该方案采用两阶段混合特征选择模型:首先通过常用过滤式特征选择方法筛选出排名靠前的特征,随后利用差分进化优化算法进一步识别最优特征。在此基础上,采用仅包含最佳特征的最终训练微阵列数据集对多种常用机器学习算法进行训练,以获得卓越的癌症分类效果。本研究使用四种高维癌症微阵列数据集(乳腺癌、肺癌、中枢神经系统癌和脑癌数据集)对所提方法进行评估。结果:实验结果表明,相较于单一过滤方法,本文提出的混合过滤器-差分进化方法在脑癌、中枢神经系统癌、乳腺癌和肺癌数据集上的分类准确率分别提升至100%、100%、93%和98%。此外,基于差分进化的特征选择方法成功移除了这四种癌症微阵列数据集中约50%由过滤方法选出的特征。与既往研究相比(脑癌、中枢神经系统癌、肺癌和乳腺癌数据集的准确率提升分别为41.43%、53.66%、17.53%、61.70%),本方法取得的平均准确率提升幅度最高达到42.47%、57.45%、16.28%和43.57%。结论:相较于已有研究,本方法在脑癌和中枢神经系统癌数据集上取得了更显著的提升幅度,在肺癌数据集上获得相近的提升效果,而在乳腺癌数据集上的提升幅度相对较小。