Topic modeling is a popular technique in machine learning and natural language processing, where a corpus of text documents is classified into themes or topics using word frequency analysis. This approach has proven successful in various biological data analysis applications, such as predicting cancer subtypes with high accuracy and identifying genes, enhancers, and stable cell types simultaneously from sparse single-cell epigenomics data. The advantage of using a topic model is that it not only serves as a clustering algorithm, but it can also explain clustering results by providing word probability distributions over topics. Our study proposes a novel topic modeling approach for clustering single cells and detecting topics (gene signatures) in single-cell datasets that measure multiple omics simultaneously. We applied this approach to examine the transcriptional heterogeneity of luminal and triple-negative breast cancer cells using patient-derived xenograft models with acquired resistance to chemotherapy and targeted therapy. Through this approach, we identified protein-coding genes and long non-coding RNAs (lncRNAs) that group thousands of cells into biologically similar clusters, accurately distinguishing drug-sensitive and -resistant breast cancer types. In comparison to standard state-of-the-art clustering analyses, our approach offers an optimal partitioning of genes into topics and cells into clusters simultaneously, producing easily interpretable clustering outcomes. Additionally, we demonstrate that an integrative clustering approach, which combines the information from mRNAs and lncRNAs treated as disjoint omics layers, enhances the accuracy of cell classification.
主题建模是机器学习与自然语言处理领域的常用技术,该方法通过词频分析将文本语料库归类至不同主题。该技术在多种生物数据分析应用中已取得显著成效,例如高精度预测癌症亚型,以及从稀疏单细胞表观基因组数据中同步识别基因、增强子和稳定细胞类型。主题模型的优势在于它不仅可作为聚类算法,还能通过提供主题间的词汇概率分布来解释聚类结果。本研究提出一种创新的主题建模方法,用于对单细胞进行聚类,并在同时测量多组学的单细胞数据集中检测主题(基因特征)。我们应用该方法,通过获得化疗和靶向治疗耐药性的患者来源异种移植模型,研究管腔型和三阴性乳腺癌细胞的转录异质性。通过此方法,我们鉴定了能够将数千个细胞归入生物学相似簇的蛋白质编码基因和长链非编码RNA,从而准确区分药物敏感型和耐药型乳腺癌。与标准的前沿聚类分析相比,我们的方法能同时实现基因到主题和细胞到簇的最优划分,并产生易于解释的聚类结果。此外,我们证明将mRNA和lncRNA作为独立组学层进行整合的聚类方法,能够显著提升细胞分类的准确性。