1.Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, China;2.Orient Science &Technology College, Hunan Agricultural University, Changsha, 410128, China
This work was supported by grants from The National Natural Science Foundation of China (61701177), Hunan Provincial Natural Science Foundation of China (2018JJ3225) and Scientific Research Project of Hunan Province Education Office (17A096).
High-accuracy splice site recognition based on machine learning is the key to eukaryotic genome annotation. In this paper, we used chi-square test to determine the window size of sequences, and constructed a chi-square statistical difference table to extract the positional features, and combined with the frequencies of dinucleotides to characterize sequences. For the problem that the positive and negative samples of splice sites are extremely imbalanced, 10 SVM classifiers based on the equal proportion of positive and negative samples were built for weighted voting, which effectively solved the imbalanced pattern classification problem. Independent testing results in HS3D dataset showed that the prediction accuracy of donor and acceptor sites were 93.39% and 90.46% respectively, obviously higher than that of the compared methods. The positional features based on the chi-square statistical difference table can effectively characterize DNA sequences, and have application prospects in signal site recognition of molecular sequences.
ZENG Ying, CHEN Yuan, YUAN Zhe-Ming. High-accuracy Splice Site Prediction Based on Statistical Difference Table and Weighted Voting[J]. Progress in Biochemistry and Biophysics,2019,46(5):496-503
Copy

Scan code to follow ® 2025 Website Copyright ICP:京ICP备05023138号-1 京公网安备 11010502031771号
