基于统计差表与加权投票的高精度剪接位点预测
CSTR:
作者:
作者单位:

1.湖南农业大学,湖南省农业大数据分析与决策工程技术研究中心,长沙 410128;2.湖南农业大学东方科技学院,长沙 410128

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61701177),湖南省自然科学基金(2018JJ3225)和湖南省教育厅科学研究项目(17A096)资助.


High-accuracy Splice Site Prediction Based on Statistical Difference Table and Weighted Voting
Author:
Affiliation:

1.Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, China;2.Orient Science &Technology College, Hunan Agricultural University, Changsha, 410128, China

Fund Project:

This work was supported by grants from The National Natural Science Foundation of China (61701177), Hunan Provincial Natural Science Foundation of China (2018JJ3225) and Scientific Research Project of Hunan Province Education Office (17A096).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键. 本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法. 基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景.

    Abstract:

    High-accuracy splice site recognition based on machine learning is the key to eukaryotic genome annotation. In this paper, we used chi-square test to determine the window size of sequences, and constructed a chi-square statistical difference table to extract the positional features, and combined with the frequencies of dinucleotides to characterize sequences. For the problem that the positive and negative samples of splice sites are extremely imbalanced, 10 SVM classifiers based on the equal proportion of positive and negative samples were built for weighted voting, which effectively solved the imbalanced pattern classification problem. Independent testing results in HS3D dataset showed that the prediction accuracy of donor and acceptor sites were 93.39% and 90.46% respectively, obviously higher than that of the compared methods. The positional features based on the chi-square statistical difference table can effectively characterize DNA sequences, and have application prospects in signal site recognition of molecular sequences.

    参考文献
    相似文献
    引证文献
引用本文

曾莹,陈渊,袁哲明.基于统计差表与加权投票的高精度剪接位点预测[J].生物化学与生物物理进展,2019,46(5):496-503

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2018-10-15
  • 最后修改日期:2019-03-21
  • 接受日期:2019-03-25
  • 在线发布日期: 2019-05-22
  • 出版日期: 2019-05-20