肺鳞状细胞癌癌症发展模式识别分类模型及特征基因识别
DOI:
作者:
作者单位:

天津大学化工学院,天津大学化工学院,大连医科大学附属第一医院肿瘤科,天津大学化工学院,德克萨斯大学西南医学中心

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金资助项目(31271351)


Pattern Recognition of The Lung Squamous Cell Carcinoma Tumor Progression Classification Model and Signature Genes Identification
Author:
Affiliation:

School of Chemical Engineering and Technology,Tianjin University,Tianjin,China,School of Chemical Engineering and Technology,Tianjin University,Tianjin,China,The First Affiliated Hospital Oncology of Dalian Medical University, Dalian, China,School of Chemical Engineering and Technology , Tianjin University, Tianjin, China;University of Texas Southwestern Medical Center, Dallas, Texas, USA

Fund Project:

This work was supported by a grant from The National Natural Science Foundation of China (31271351)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    本文利用先进的生物信息学方法,首次从全基因组水平综合基因表达、甲基化水平和拷贝数变异三类数据,寻找与肺鳞状细胞癌(LUSC)发生和发展密切相关的特征基因,为进一步解释其内在机理、开发新的靶向药物和治疗手段提供更加深入的理论依据.为克服全基因组数据超高维高噪声小样本特性对机器学习算法性能的影响,防止信息饱和现象的干扰,本文创新性地组合应用4种特征基因筛选方法,分别从特异性、相关性、生物学功能和对肿瘤分类模型的贡献等多个方面,通过迭代降维技术递归筛选真正的特征基因.研究中,我们以TCGA(The Cancer Genome Atlas project)数据库中的LUSC Ⅰ~Ⅲ期病人样本为例,对其基因表达数据(GE)、基因甲基化数据 (ME) 以及拷贝数变异数据(CNV)进行分析.结果筛选出67个GE特征基因,对3类样本分类的平均准确率达到86.29%,70个ME特征基因,相应的分类准确率为90.92%,31个CNV特征基因,相应的分类准确率为69.16%.KEGG(Kyoto Encyclopedia of Genes and Genomes)和IPA(Ingenuity Pathway Analysis)对上述3类特征基因集在代谢通路水平和基因调控网络水平上的分析,证明了其在调控水平上的密切关系.同时也表明,识别的特征基因与LUSC肿瘤进展之间有着重要的直接关系,这对了解肿瘤机理以及新靶向治疗的发展非常重要.

    Abstract:

    To identify signature genes for the tumor progression of lung squamous cell carcinoma, which provides a deeper theoretical basis for further explanation of its inherent mechanism, new targeted drugs and treatments development. The pattern recognition method was used to analysis the genome-wide mRNA gene expression (GE) values, methylation values (ME), and copy number variation (CNV) data. To overcome the disadvantages inherent in the genome-wide data such as ultrahigh-dimensional-small-size, high-noise and multi-correlation among genes, and to overcome the predominate influence of the whole genome to the dozens of signature genes, a new iterative multiple variable selection strategy was used to identify signature genes step by step. The importance of genes was comprehensively evaluated by their significant difference with SAM (significant analysis of microarray), statistical analysis using PLS (partial least squares), known biological functions and contributions to the classification model. 67 GE signature genes, 70 ME signature genes and 31 CNV signature genes were identified from the LUSC stageⅠ~Ⅲ patient samples in TCGA (The Cancer Genome Atlas project) database. The corresponding accuracies from 5 fold cross-validation are: 86.29%, 90.92 % and 69.16% respectively. The genetic network analysis and pathway analysis using KEGG (Kyoto Encyclopedia of Genes and Genomes) and IPA (Ingenuity Pathway Analysis) indicated the highly related relationship among these three kinds of genes. They also indicated the immediate relationship between our signature genes and the progression of LUSC which is very important to the understanding of its mechanism and to the development of new targeted therapy.

    参考文献
    相似文献
    引证文献
引用本文

张飞,王世祥,王玲,宋凯.肺鳞状细胞癌癌症发展模式识别分类模型及特征基因识别[J].生物化学与生物物理进展,2016,43(1):63-74

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2015-11-06
  • 最后修改日期:2015-11-29
  • 接受日期:2015-12-03
  • 在线发布日期: 2016-01-19
  • 出版日期: 2016-01-20