Background/Objectives: Gliomas are complex and heterogeneous brain tumors characterized by an unfavorable clinical course and a fatal prognosis, which can be improved by an early determination of tumor kind. Here, we developed explainable machine learning (ML) models for classifying three major glioma subtypes (astrocytoma, oligodendroglioma, and glioblastoma) and predicting survival rates based on RNA-seq data.Methods: We analyzed publicly available datasets and applied feature selection techniques to identify key biomarkers. Using various ML models, we performed classification and survival analysis to develop robust predictive models. The best-performing models were then interpreted using Shapley additive explanations (SHAP).Results: Thirteen key genes (TERT,NOX4,MMP9,TRIM67,ZDHHC18,HDAC1,TUBB6,ADM,NOG,CHEK2,KCNJ11,KCNIP2, andVEGFA) proved to be closely associated with glioma subtypes as well as survival. Support Vector Machine (SVM) turned out to be the optimal classification model with the balanced accuracy of 0.816 and the area under the receiver operating characteristic curve (AUC) of 0.896 for the test datasets. The Case-Control Cox regression model (CoxCC) proved best for predicting survival with the Harrell’s C-index of 0.809 and 0.8 for the test datasets. Using SHAP we revealed the gene expression influence on the outputs of both models, thus enhancing the transparency of the prediction generation process.Conclusions: The results indicated that the developed models could serve as a valuable practical tool for clinicians, assisting them in diagnosing and determining optimal treatment strategies for patients with glioma.
背景/目的:胶质瘤是一种复杂且异质性的脑肿瘤,其临床病程不良且预后致命,而早期确定肿瘤类型可改善这一状况。本研究基于RNA-seq数据,开发了可解释的机器学习模型,用于对三种主要胶质瘤亚型(星形细胞瘤、少突胶质细胞瘤和胶质母细胞瘤)进行分类并预测生存率。 方法:我们分析了公开可用的数据集,并应用特征选择技术识别关键生物标志物。通过多种机器学习模型进行分类和生存分析,以建立稳健的预测模型。随后使用沙普利加性解释方法对表现最佳的模型进行可解释性分析。 结果:研究证实13个关键基因(TERT、NOX4、MMP9、TRIM67、ZDHHC18、HDAC1、TUBB6、ADM、NOG、CHEK2、KCNJ11、KCNIP2和VEGFA)与胶质瘤亚型及生存率密切相关。支持向量机被证明是最佳分类模型,在测试数据集上平衡准确率达0.816,受试者工作特征曲线下面积为0.896。病例对照Cox回归模型在生存预测中表现最优,测试数据集的Harrell's C指数分别达到0.809和0.8。通过SHAP分析,我们揭示了基因表达对两个模型输出的影响机制,从而增强了预测生成过程的透明度。 结论:研究结果表明,所开发的模型可作为临床医生的实用工具,协助其对胶质瘤患者进行诊断并制定最佳治疗方案。
Explainable Machine Learning Models for Glioma Subtype Classification and Survival Prediction