Recent studies have demonstrated that principal component analysis (PCA) can detect the presence of population mixture and admixture in a sample and thus can be used to correct population stratification in genome-wide association studies (GWAS). We propose a complementary approach to PCA that compensates for potential weaknesses associated with PCA, so that one can perform population structure analyses using limited numbers of subjects and single-nucleotide polymorphisms (SNPs). Our method first requires a PCA of the largest reference sample from a population to standardize the system. Once the system is established, it can perform PCA for each individual with a much smaller number of SNPs drawn from the same population. This is because of the introduction of the probabilistic PCA, so that the prediction of the principal components (PCs) is performed under a rigorous probabilistic framework. The subsequent linear discriminant analysis also helps to understand from which ancestries or subpopulations a given individual is more likely to derive, in terms of posterior probabilities given the predicted PCs. A real-world prototype of the system for the Japanese population is developed based on 19 260 subjects, which illustrates the potential usefulness of the system as an aid in the detection of population structures in validation samples, or to help with the correction of population stratification in GWAS.
All Science Journal Classification (ASJC) codes