1)陕西师范大学计算机科学学院,西安 710119;2)公安部鉴定中心,法医遗传学公安部重点实验室,北京市现场物证检验工程技术研究中心, 现场物证溯源技术国家工程实验室,北京 100038
国家重点研发计划(2022YFC3341004),国家自然科学基金(82171870),陕西省自然科学基金(2022ZJ-39),法医遗传学公安部重点实验室开放课题(2023FGKFKT01)和公安部鉴定中心基本科研业务费专项资金(2022JB020)资助项目。
1)School of Computer Science, Shaanxi Normal University, Xi’an710119, China;2)Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing100038, China
This work was supported by grants from National Key R&D Program of China (2022YFC3341004), The National Natural Science Foundation of China (82171870), Key Project of Natural Science Foundation of Shaanxi Province (2022ZJ-39), The Key Laboratory of Forensic Genetics Open Project (2023FGKFKT01), and The Fundamental Research Funds for Institute of Forensic Science (2022JB020).
目的 通过DNA推断个体的生物地理祖源(biogeographical ancestry,BGA)在人类学、法医学等领域广受关注。目前常用方法是使用几十个祖先信息单核苷酸多态性(single nucleotide polymorphism,SNP)位点,通过主成分分析(principal component analysis,PCA)、似然比(likelihood ratio,LR)等方法判断个体的祖源。伴随高通量测序技术的发展,批量获取人群样本的高密度SNP数据集变得容易,同时计算机领域中机器学习等技术的引入,使得BGA研究发展出新的变化。本研究旨在构建适应高密度SNP数据,且具有高准确率和良好泛化能力的BGA推断模型。方法 首先基于307 866个SNP的数据,使用机器学习领域中的监督学习模型XGBoost,构建了基于多维度主成分(principal component,PC)的PCA-XGBoost推断模型,其次基于LR对推断结果进行评估和优化模型,确定了最佳PC数目和模型训练轮数,最后在其他公共数据的测试集上进一步验证模型的表现。结果 基于LR的结果评估方法,模型在参考集中人群预测准确率可以达到95%以上,在测试集中准确率可以达到90%以上,结论 PCA-XGBoost模型具有较高的洲际人群预测准确性,基于LR的结果评估方法有助于对预测结果的可靠性进行进一步评估。该模型具有很好的泛化能力,更换参考集的人群数据后,有望实现更加精细的人群分析。
Objective The inference of biogeographical ancestry (BGA) using DNA is a significant focus within anthropology and forensic science. Current methods often utilize dozens of ancestry-informative SNPs, employing principal component analysis (PCA) and likelihood ratios (LR) to ascertain individual ancestries. Nonetheless, the selection of these SNPs tends to be population-specific and shows limitations in population differentiation. With the development of high-throughput sequencing technologies, acquiring high-density SNP datasets has become easier, challenging traditional statistical models which are often reliant on prior assumptions and struggle with high-density genetic data. The integration of machine learning, which prioritizes data learning and algorithmic iteration over prior knowledge, has propelled forward new developments in BGA research. This study aims to construct a BGA inference model suitable for high-density SNP data, characterized by broad population applicability, higher accuracy, and strong generalization capabilities.Methods Initially, intersection sites of autosomes from the phase III data of the 1000 Genomes Project and commonly used commercial chips were selected to build a reference dataset after thorough site quality control and filtering. This dataset was analyzed using PCA and ADMIXTURE to study population clustering, ancestral component mixing, and genetic substructures. Utilizing spaces of different principal component (PC), combinations, this study visually assessed the PCs’ capabilities to differentiate between continental and intercontinental populations. Following this, the study employed the supervised learning classification model XGBoost, establishing a multidimensional PC-based PCA-XGBoost model with hyperparameters set through ten-fold cross-validation and a greedy strategy. Subsequently, the model was optimized and evaluated based on the LR, considering accuracy and runtime to determine the optimal number of PCs and training rounds, culminating in the study’s optimal BGA inference model. Finally, the performance of the model was subsequently validated at national and regional levels using test sets from other public data to assess its post-optimization generalization capabilities.Results The reference dataset created contains 307 866 SNP sites. Top PCs reflect varying levels of population differentiation capabilities, with some PCs showing population specificity. Under smaller K values in ADMIXTURE results, genetic ancestral components between continents are elucidated, while larger K values reveal some specific ancestral components of certain populations within continents. The number of PCs and training rounds significantly affect the classification accuracy and efficiency of the XGBoost supervised model. With LR-based evaluation methods, the optimized PCA-XGBoost model achieved a continental prediction accuracy of over 98% in the reference set. For subcontinental population levels within the continents, the model achieved an accuracy of over 95% in the reference set and over 90% in the test set.Conclusion The reference dataset effectively represents the genetic substructures of populations at selected sites. Information derived from PC dimensions significantly aids in population differentiation and inference issues, and incorporating more PC dimensions as features in supervised learning models can increase the accuracy of BGA inference. The model of this study is suitable for high-density SNP data and is not confined to specific regional populations, offering enhanced population-wide applicability. Compared to previous ancestry inference models, the optimized PCA-XGBoost model demonstrates high intercontinental population predictive accuracy. LR-based evaluation methods further enhance the reliability of predictions. Additionally, the model’s strong generalization capabilities suggest that updating the reference population data could enable more detailed population analysis and inference.
姚昊天,江丽,王春年,范虹,李彩霞.基于PCA-XGBoost方法的洲际人群生物地理祖源推断模型研究[J].生物化学与生物物理进展,2024,51(12):3292-3309
复制生物化学与生物物理进展 ® 2025 版权所有 ICP:京ICP备05023138号-1 京公网安备 11010502031771号