Background/Objectives: Recent growth in the number and applications of high-throughput “omics” technologies has created a need for better methods to integrate multiomics data. Much progress has been made in developing unsupervised methods, but supervised methods have lagged behind.Methods: Here we present the first algorithm, PLASMA, that can learn to predict time-to-event outcomes from multiomics data sets, even when some samples have only been assayed on a subset of the omics data sets. PLASMA uses two layers of existing partial least squares algorithms to first select components that covary with the outcome and then construct a joint Cox proportional hazards model.Results: We apply PLASMA to the stomach adenocarcinoma (STAD) data from The Cancer Genome Atlas. We validate the model both by splitting the STAD data into training and test sets and by applying them to the subset of esophageal cancer (ESCA) containing adenocarcinomas. We use the other half of the ESCA data, which contains squamous cell carcinomas dissimilar to STAD, as a negative comparison. Our model successfully separates both the STAD test set (p= 2.73 × 10−8) and the independent ESCA adenocarcinoma data (p= 0.025) into high-risk and low-risk patients. It does not separate the negative comparison data set (ESCA squamous cell carcinomas,p= 0.57). The performance of the unified multiomics model is superior to that of individually trained models and is also superior to an unsupervised method (Multi-Omics Factor Analysis; MOFA), which finds latent factors to be used as putative predictors in a post hoc survival analysis.Conclusions: Many of the factors that contribute strongly to the PLASMA model can be justified from the biological literature.
背景/目的:高通量“组学”技术数量和应用的增长,催生了对更优多组学数据整合方法的需求。无监督方法的发展已取得显著进展,但有监督方法的发展相对滞后。 方法:本文首次提出PLASMA算法,该算法能够从多组学数据集中学习预测时间-事件结局,即使部分样本仅进行了部分组学数据检测。PLASMA采用两层现有偏最小二乘算法,首先筛选与结局协变的成分,进而构建联合Cox比例风险模型。 结果:我们将PLASMA应用于癌症基因组图谱(TCGA)中的胃腺癌(STAD)数据。通过将STAD数据划分为训练集与测试集,并应用于包含腺癌的食管癌(ESCA)亚组数据,对模型进行了双重验证。同时,我们使用ESCA数据中与STAD差异显著的鳞状细胞癌数据作为阴性对照。该模型成功将STAD测试集(p=2.73×10⁻⁸)和独立的ESCA腺癌数据(p=0.025)区分为高风险与低风险患者,而对阴性对照数据集(ESCA鳞状细胞癌,p=0.57)未产生显著区分效果。统一多组学模型的性能优于单独训练的模型,也优于无监督方法(多组学因子分析;MOFA)——后者通过寻找潜在因子作为事后生存分析的假定预测因子。 结论:PLASMA模型中许多具有强贡献度的因子均可在生物学文献中找到理论依据。