肿瘤(癌症)患者之家
首页
癌症知识
肿瘤中医药治疗
肿瘤药膳
肿瘤治疗技术
前沿资讯
登录/注册
VIP特权

文章目录

高维数据空间的特性:探索基因和蛋白质表达数据的含义

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

原文发布日期:2008-01-01

DOI: 10.1038/nrc2294

类型: Review Article

开放获取: 否

要点:

要点翻译:

英文摘要:

摘要翻译: 

原文链接:

文章:

高维数据空间的特性:探索基因和蛋白质表达数据的含义

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

原文发布日期:2008-01-01

DOI: 10.1038/nrc2294

类型: Review Article

开放获取: 否

 

要点:

  1. The application of several high-throughput genomic and proteomic technologies to address questions in cancer diagnosis, prognosis and prediction generate high-dimensional data sets.
  2. The multimodality of high-dimensional cancer data, for example, as a consequence of the heterogeneous and dynamic nature of cancer tissues, the concurrent expression of multiple biological processes and the diverse and often tissue-specific activities of single genes, can confound both simple mechanistic interpretations of cancer biology and the generation of complete or accurate gene signal transduction pathways or networks.
  3. The mathematical and statistical properties of high-dimensional data spaces are often poorly understood or inadequately considered. This can be particularly challenging for the common scenario where the number of data points obtained for each specimen greatly exceed the number of specimens.
  4. Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations.
  5. The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.
  6. Owing to the 'curse of dimensionality' phenomenon and its negative impact on generalization performance, for example, estimation instability, model overfitting and local convergence, the large estimation error from complex statistical models can easily compromise the prediction advantage provided by their greater representation power. Conversely, simpler statistical models may produce more reproducible predictions but their predictions may not always be adequate.
  7. Some machine learning methods address the 'curse of dimensionality' in high-dimensional data analysis through feature selection and dimensionality reduction, leading to better data visualization and improved classification.
  8. It is important to ensure that the generalization capability of classifiers derived by supervised learning methods from high-dimensional data before using them for cancer diagnosis, prognosis or prediction. Although this can be assessed initially through cross-validation methods, a more rigorous approach is needed, that is, to validate classifier performance using a blind validation data set(s) that was not used during supervised learning.

 

要点翻译:

  1. 多种高通量基因组学和蛋白质组学技术在癌症诊断、预后及预测研究中的应用产生了高维数据集。
  2. 高维癌症数据的多模态特性——例如源于癌组织的异质性和动态本质、多种生物过程的并行表达以及单个基因多样化且常具组织特异性的活动——可能混淆对癌症生物学的简单机制性解读,并阻碍完整或准确的基因信号转导通路或网络的构建。
  3. 高维数据空间的数学和统计特性常未被充分理解或恰当考量。这在每个样本获取的数据点数量远超样本数量的常见情境中尤为棘手。
  4. 数据在高维空间中极少随机分布且存在高度相关性,常伴随伪相关现象。
  5. 在高维环境中,数据点与其最近邻和最远邻的距离可能趋于等距,这可能会影响某些基于距离的分析工具的准确性。
  6. 由于"维度灾难"现象及其对泛化性能的负面效应(例如估计不稳定性、模型过拟合和局部收敛),复杂统计模型产生的大估计误差极易削弱其更强表征能力带来的预测优势。反之,简单统计模型可能产生更具重现性的预测,但其预测效果未必始终理想。
  7. 部分机器学习方法通过特征选择和降维策略应对高维数据分析中的"维度灾难",从而实现更优的数据可视化和分类效果。
  8. 在使用监督学习方法从高维数据推导出的分类器进行癌症诊断、预后或预测前,确保其泛化能力至关重要。虽然初期可通过交叉验证方法进行评估,但更需要采用严格验证策略——即使用监督学习过程中未接触的盲法验证数据集来检验分类器性能。

 

英文摘要:

High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation.

摘要翻译: 

高通量基因组学和蛋白质组学技术广泛应用于癌症研究,以构建更精准的诊断、预后和治疗预测模型,识别并描绘关键信号网络,并发现新的药物开发靶点。这些技术使研究人员面临从高维数据空间中提取有意义的统计学和生物学信息的任务,其中每个样本由数百或数千个通常同时获得的测量值定义。高维特性在数据建模和分析中常被忽视或理解不足。从转化科学的视角出发,本综述探讨了基因组学和蛋白质组学研究中高维数据空间的特性,以及它们给数据分析和解释带来的挑战。

原文链接:

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

相关文章

文章:肿瘤抗原优先来源于黑色素瘤和非小细胞肺癌中未突变的基因组序列
文章:年龄相关的烟酰胺腺嘌呤二核苷酸下降驱动CAR-T细胞衰竭
文章:MCSP+转移创始细胞在人类黑色素瘤转移定植早期激活免疫抑制
文章:脂质纳米颗粒递送合成抗原使实体瘤对car介导的细胞毒性敏感
文章:食管癌新辅助治疗中的进化和免疫微环境动力学
文章:CHD1缺失重编程srebp2驱动的胆固醇合成,在spop突变的前列腺肿瘤中促进雄激素响应性生长和去势抵抗
文章:对TIL细胞治疗无反应的转移性非小细胞肺癌患者的T细胞和新抗原保留受损的时间序列分析
文章:策展的癌细胞图谱提供了单细胞分辨率的肿瘤的全面表征
文章:以人群为基础的胶质瘤分子景观分析在青少年和年轻人揭示胶质瘤形成的见解
文章:肿瘤细胞上的PILRα与T细胞表面蛋白CD99相互作用抑制抗肿瘤免疫

……