Background: Lung cancer is a leading cause of cancer-related mortality, with disparities in incidence and outcomes observed across different racial and sex groups. Identifying both patient-specific and cohort-specific disparity biomarkers is critical for developing targeted treatments. The lung cancer dataset is highly imbalanced across races, leading to biased results in disparity information if classification is based on race. Method: This study developed an explainable artificial intelligence-based framework, TILDA-X, which designs classification models based on disease conditions instead of races to mitigate racial imbalance in the dataset and applies explainable AI to delineate patient-specific disparity information. A lung cancer transcriptome dataset with three disease conditions—lung adenocarcinoma, lung squamous cell carcinoma, and healthy samples—was used to develop classification models. Applying a bottom-up approach from patient-specific disparity information, the cohort-specific disparity information is discovered for different racial and sex groups, African American males, European American males, African American females, and European American females. Results: Classification based on disease conditions achieved accuracy between 88% and 100% for minority groups (African American males and females), whereas it was only between 0% and 16% for race-based classification, which underscores the significance of the proposed approach. Functional analysis of sub-cohort-specific biomarker genes revealed unique pathways associated with lung cancers in different races and sexes. Among the significant pathways identified, over ~63% overlapped with previously reported lung cancer-related studies, supporting the biological validity of our findings. Overall, combining disease conditions-based classification with explainable AI, this study provides a robust, interpretable framework for characterizing race- and sex-specific disparities in lung cancer, offering a foundation for precision oncology and equitable therapeutic development based on transcriptome profile only.
背景:肺癌是癌症相关死亡的主要原因,在不同种族和性别群体中观察到发病率和治疗结果存在显著差异。识别患者特异性与群体特异性差异生物标志物对于开发靶向治疗至关重要。肺癌数据集在不同种族间存在高度不平衡,若基于种族进行分类,将导致差异信息分析结果产生偏差。方法:本研究开发了一种基于可解释人工智能的框架TILDA-X,该框架通过基于疾病状态而非种族构建分类模型,以缓解数据集中的种族不平衡问题,并应用可解释人工智能技术解析患者特异性差异信息。研究采用包含肺腺癌、肺鳞状细胞癌及健康样本三类疾病状态的肺癌转录组数据集构建分类模型。通过自下而上的分析方法,从患者特异性差异信息中挖掘出非裔美国男性、欧裔美国男性、非裔美国女性和欧裔美国女性等不同种族与性别群体的群体特异性差异信息。结果:基于疾病状态的分类模型在少数群体(非裔美国男性和女性)中达到88%至100%的准确率,而基于种族的分类准确率仅为0%至16%,凸显了本研究所提方法的重要意义。对亚群体特异性生物标志物基因的功能分析揭示了不同种族和性别群体中与肺癌相关的独特通路。在识别出的重要通路中,约63%与既往报道的肺癌相关研究结果重叠,支持了本研究发现的生物学有效性。结论:通过将基于疾病状态的分类方法与可解释人工智能相结合,本研究构建了一个稳健且可解释的框架,用于表征肺癌中种族与性别特异性差异,为仅基于转录组图谱的精准肿瘤学和公平性治疗开发奠定了基础。
TILDA-X: Transcriptome-Informed Lung Cancer Disparities via Explainable AI