1.1)陕西师范大学计算机科学学院,西安 710119;2.2)公安部物证鉴定中心,北京 100038;3.3)江苏师范大学,江苏省系统发育与比较基因组学重点实验室,徐州 221116;4.4)山西医科大学法医学院,太原 030001
陕西省重点研发计划项目(2018SF-251),国家自然科学基金(81772027),国家工程实验室开放课题(2018NELKFKT15),法医遗传学公安部重点实验室开放课题(2020FGKFKT01)和江苏省高校重大项目(17KJA180003)资助.
1.1)School of Computer Science, Shaanxi Normal University, Xi'an 710119, China;2.2)Physical Evidence Evaluation Center of the Ministry of Public Security, Beijing 100038, China;3.3)Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Xuzhou 221116, China;4.4)School of Forensic Medcine, Shanxi Medical University, Taiyuan 030001, China
This work was supported by grants from the Key Research and Development Program of Shanxi Province (2018SF-251), The National Natural Science Foundation of China (81772027), Open Projects of National Engineering Laboratory (2018NELKFKT15), Open Projects of the Key Laboratory of Forensic Genetics of the Ministry of Public Security (2020FGKFKT01), and Major Projects of Universities in Jiangsu Province (17KJA180003).
单核苷酸多态性(single nucleotide polymorphism,SNP)是法医遗传学个体识别和族群推断常用的遗传标记. 本研究集合文献和公共库中祖先信息SNP位点(ancestry informative SNPs,AISNPs),应用softmax回归、支持向量机和随机森林3种算法,研究东亚北方的3个主体人群(中国北方汉族人、日本人和韩国人)的族群推断效果. 我们分析了来自千人基因组计划的103份中国北方汉族人样本、104份日本人样本和亚洲多样性计划的100份韩国人样本的428个AISNP位点分型,采用多元线性回归共线性诊断筛选出67个高信息量的AISNPs位点组合,构建了softmax回归和支持向量机算法的两种族群推断模型,采用随机森林平均降准分析筛选出42个高信息量的AISNPs位点组合,并构建了随机森林算法的族群推断模型,将softmax回归、支持向量机与随机森林3种模型用于北方汉族人、日本人、韩国人的族群推断,五次十折交叉验证(training∶testing=9∶1)测试3种模型的平均准确率分别为95.19%、95.77%、94.53%. 本研究建立的3种族群推断模型均可用于东亚北方三大人群的遗传推断,42 AISNPs组合的位点数目较少,更适于构建法医检测体系,具有较高的实际应用价值.
Single nucleotide polymorphism (SNP) profiling is a commonly used genetic tool for individual identification and ancestry inference in forensic genetics. This study collected ancestry informative SNPs (AISNPs) from literature and public libraries, and applied softmax regression, support vector machine and random forest, which were used to infer ancestry origins of Northern Han, Japanese and Korean, the three major populations in the North of East Asia. We analyzed 428 AISNPs in 103 northern Han samples and 104 Japanese samples from the 1 000 Genomes Project and 100 Korean samples from the Asian Diversity Project, using multiple linear regression collinearity diagnostics and random forest mean decrease accuracy to screen and optimize high-information AISNPs combinations which were used for ancestry inference linear and nonlinear prediction models, respectively. We constructed two discriminant models of softmax regression and support vector machine with 67-plex AISNPs and a random forest discriminant model with 42-plex AISNPs, achieving high-precision division of Northern Han, Japanese and Korean. The accuracy rates of the 5 times 10-fold cross-validation test of the softmax regression model, support vector machine model and random forest model were 95.19%, 95.77%, and 94.53%, respectively. The 67-plex and 42-plex AISNP prediction models established in this study can be used for genetic inference of the three major populations in the North of East Asia with high practical application value.
文豪,魏以梁,郭晓媛,孙昌春,薛思瑶,刘京,范虹,江丽.东亚三族群SNP高分辨推断模型构建与效能评估[J].生物化学与生物物理进展,2021,48(8):973-981
复制生物化学与生物物理进展 ® 2024 版权所有 ICP:京ICP备05023138号-1 京公网安备 11010502031771号