1)School of Computer Science, Shaanxi Normal University, Xi''an 710119, China;2)Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China
This work was supported by grants from National Key R&D Program of China (2022YFC3341004), The Natural Science Foundation of China (82171870), Key Project of Natural Science Foundation of Shaanxi Province(2022ZJ-39), The Key Laboratory of Forensic Genetics Open Project (2023FGKFKT01), and The Fundamental Research Funds for Institute of Forensic Science (2022JB020).
Objective The inference of biogeographical ancestry (BGA) using DNA is a significant focus within anthropology and forensic science. Current methods often utilize dozens of ancestry-informative SNPs, employing principal component analysis (PCA) and likelihood ratios (LR) to ascertain individual ancestries. Nonetheless, the selection of these SNPs tends to be population-specific and shows limitations in population differentiation. With the development of high-throughput sequencing technologies, acquiring high-density SNP datasets has become easier, challenging traditional statistical models which are often reliant on prior assumptions and struggle with high-density genetic data. The integration of machine learning, which prioritizes data learning and algorithmic iteration over prior knowledge, has propelled forward new developments in BGA research. This study aims to construct a BGA inference model suitable for high-density SNP data, characterized by broad population applicability, higher accuracy, and strong generalization capabilities.Methods Initially, intersection sites of autosomes from the phase III data of the 1000 Genomes Project and commonly used commercial chips were selected to build a reference dataset after thorough site quality control and filtering. This dataset was analyzed using PCA and ADMIXTURE to study population clustering, ancestral component mixing, and genetic substructures. Utilizing spaces of different principal component (PC), combinations, this study visually assessed the PCs" capabilities to differentiate between continental and intercontinental populations. Following this, the study employed the supervised learning classification model XGBoost, establishing a multidimensional PC-based PCA-XGBoost model with hyperparameters set through ten-fold cross-validation and a greedy strategy. Subsequently, the model was optimized and evaluated based on the LR, considering accuracy and runtime to determine the optimal number of PCs and training rounds, culminating in the study"s optimal BGA inference model. Finally, the performance of the model was subsequently validated at national and regional levels using test sets from other public data to assess its post-optimization generalization capabilities.Results The reference dataset created contains 307 866 SNP sites. Top PCs reflect varying levels of population differentiation capabilities, with some PCs showing population specificity. Under smaller K values in ADMIXTURE results, genetic ancestral components between continents are elucidated, while larger K values reveal some specific ancestral components of certain populations within continents. The number of PCs and training rounds significantly affect the classification accuracy and efficiency of the XGBoost supervised model. With LR-based evaluation methods, the optimized PCA-XGBoost model achieved a continental prediction accuracy of over 98% in the reference set. For subcontinental population levels within the continents, the model achieved an accuracy of over 95% in the reference set and over 90% in the test set.Conclusion The reference dataset effectively represents the genetic substructures of populations at selected sites. Information derived from PC dimensions significantly aids in population differentiation and inference issues, and incorporating more PC dimensions as features in supervised learning models can increase the accuracy of BGA inference. The model of this study is suitable for high-density SNP data and is not confined to specific regional populations, offering enhanced population-wide applicability. Compared to previous ancestry inference models, the optimized PCA-XGBoost model demonstrates high intercontinental population predictive accuracy. LR-based evaluation methods further enhance the reliability of predictions. Additionally, the model"s strong generalization capabilities suggest that updating the reference population data could enable more detailed population analysis and inference.
YAO Hao-Tian, JIANG Li, WANG Chun-Nian, FAN Hong, LI Cai-Xia. Research on The Intercontinental Population Biogeographic Ancestral Inference Model Based on PCA-XGBoost Method[J]. Progress in Biochemistry and Biophysics,,():
Copy® 2024 All Rights Reserved ICP:京ICP备05023138号-1 京公网安备 11010502031771号