Use the link below to share a full-text version of this article with your friends and colleagues.

## Statistics in human genetics and molecular biology / Cavan Reilly. - Version details - Trove

Learn more. If you have previously obtained access with your personal account, Please log in. If you previously purchased this article, Log in to Readcube.

Log out of Readcube. Click on an option below to access. Log out of ReadCube. Volume 67 , Issue 4. The full text of this article hosted at iucr. If you do not receive an email within 10 minutes, your email address may not be registered, and you may need to create a new Wiley Online Library account. If the address matches an existing account you will receive an email with instructions to retrieve your username.

## Statistics in Human Genetics and Molecular Biology

David B. Tools Request permission Export citation Add to favorites Track citation. Share Give access Share full text access. Share full text access. In contrast, the first two dimensions of the MDS using V and all the individuals explained These results supports that the V transformation reduces the amount of non-shared i.

The average amount of minimum sampling site differentiation based on these 26 clusters was 0. In contrast, Mclust using the first 10 MDS dimensions from the V matrix proposed 37 different genetic clusters, all the populations sharing at least one of the proposed clusters see Figure 3B. GemTools analysis proposed 56 different genetic clusters see Figure 3D.

Only in the case of Belgrade there was an improvement compared to all the other methods see Figure 4. Hence, GAGA was able to increase the geographic resolution compared to other methods. We further analysed the effect of different sample size in the outcome of the different methods. We repeated all the analyses with a subset of 19 populations after excluding Lisbon-Portugal, Dublin-Ireland, Budapest-Hungary and Bucharest- Romania with equal sample size of 40 individuals.

Nevertheless, the association between the proposed clusters and the sub population samples increases in all the methods. This is particularly pronounced in the case of GemTools see Table 1. The values of minimum informativeness differentiation of GemTools increased to 0. We have described a new matrix distance transformation that tends to minimize the within-population variance without knowing a priori the sub populations, and have shown, by means of computer simulations and application to real European genetic data, that this new approach improves the differentiation among sub populations compared to the original distance matrix.

A practical result of our analyses is that this matrix transformation improves the output of MDS, both at the level of explained variance and resolution, as well as from the AMOVA estimations. One could also consider estimating the K based on parameterized Gaussian mixture models [56] such as implemented in Mclust.

Nevertheless, the choice of K is rather arbitrary depending on the required resolution and subject of further study.

### Top Authors

Most importantly, our findings of previously undetected fine-scale human population substructure down to the level of sampling sites or subpopulations within Europe, has important implications for various basic and applied fields of life science. With relevance for genetic epidemiology, our results suggest that the genetic homogeneity detection desired in case-control studies should be preferably established by analyzing the relationships of pairs of individuals in the context of all other individuals tested, rather than by analyzing how genetically similar individuals are, as usually done.

The GAGA approach we introduce here is now available for application to all types of genetic data. A Model of three parental populations and one admixed population, each one with 10 individuals black dots, only two individuals per population are shown in the graph. Eight different possible situations where considered. For each possible combination, simulations were conducted. The edge distance of an individual to the adjacency vertex population was randomly modelled using a uniform distribution U 0.

## A Mathematician's Odyssey

The assumed error in the estimation was computed following a Normal distribution N 0, 0. The distance between adjacent populations was simulated from a uniform distribution with parameters U m, 1 if the distance was larger than the minimum individual distance to his population m or U 0,m if the distance between two adjacent populations was smaller. Analysis pipeline applied to the genome-wide data from 2, individuals from 23 European subpopulations [21]. The distance matrix is then used to perform a MDS analysis procedure 2 , resulting in a set of MDS coordinates in a reduced Euclidean space.

Applying clustering algorithms, such as Mclust, on the MDS coordinates by supplying an arbitrary number of clusters, k, will assign all individuals to k clusters procedure 3. This clustering configuration can be evaluated for concordance with their true population sampling origin labels using cross-tabulations procedure 4 by means of, for example, minimum Informativeness of ancestry, which gives a single numeric value for each population between 0 and log 2 with a larger value for a higher population differentiation procedure 5.

- Cavan S. Reilly.
- Jean-Michel Marin research;
- Statistics in Human Genetics and Molecular Biology by REILLY, C.;
- Cavan S. Reilly;
- Learning to Read Critically in Teaching and Learning (Learning to Read Critically series 515).
- Information!

In the case of SPA, Mclust is applied to identify clusters of individuals procedure 4. This step highlights the genetic differentiation among the a priori unknown sub populations. A genetic algorithm is then applied to search for the optimal clustering configuration procedure 8. Demographic scenarios used to test the performance of V and D matrices. Each simulation consists of 10, randomly ascertained SNPs see Figure 2 for simulations with , SNPs simulated with ms software in 25 populations 10 diploid individuals in each population.

Each population exchanges a fraction of m migrants with the neighbor populations each generation. Percentage of variation explained by each eigenvalue from a classical Multidimensional Scaling analysis when using the D based on the T1 statistic or the transformed V distance matrix on 2, European individuals sampled at , Linkage Disequilibrium LD pruned SNPs. Underlined populations were excluded from the analyses considering equal sample size.

Table showing the individuals with the best overall genetic match BOM in the same population of sampling or in a different population when using the D statistic based on T1 similarity as measure of genetic dissimilarity. Table showing the individuals with the BOM in the same population of sampling or in a different population when using the V statistic as measure of genetic dissimilarity. Supplementary information describing the Pseudocode for the Computation of the V matrix, implementation of the Genetic algorithm for exploring the space of solutions and demographic simulations.

- Biostatistics.
- Rutgers SAS Navigation.
- Buying Travel Services on the Internet (CommerceNet Press Series)!
- Guinea Pigs Dont Bounce;
- Statistics in Human Genetics and Molecular Biology : Cavan Reilly : !
- A Mathematician's Odyssey | Annual Review of Genomics and Human Genetics.
- mathematics and statistics online;

We are grateful to the numerous colleagues who contributed with either samples or data to the establishment of the previously published genome-wide European dataset used here: Miroslava Balascakova, Jaume Bertranpetit, Laurence A. Nelson, and Michael Krawczak. Conceived and designed the experiments: OL FL. Abstract Attempts to detect genetic population substructure in humans are troubled by the fact that the vast majority of the total amount of observed genetic variation is present within populations rather than between populations.

Author Summary Understanding genetic population substructure is important in evolutionary biology, behavioral ecology, medical genetics and forensic genetics, among others.

Introduction At what degree genetically homogeneous groups of human individuals exist is a long-standing and yet unsolved debate in the scientific community [1]. Materials and Methods Quantifying the amount of genetic differentiation between populations Our algorithm starts with a genetic distance matrix D computed for each possible pair among N individuals, which in this study is derived from the T1 statistic [31]. T1 is defined for a given pair of individuals i and j as: 1 where n xx,yy denotes the number of SNPs of a particular genotype pattern i.

The rationale for proposing the V matrix transformation is as follows: Following the AMOVA framework, individual relationships are modelled using a list colouring of graph [38] , so each vertex can be either assigned to an individual, a non-admixed population, an admixed population, or a group of populations see Figure 1 A ; therefore, for a pair of individuals i, j, the distance d i,j can be decomposed in within- and between-population distances see Figure 1B : 5.

Download: PPT. Figure 1.

Genetic algorithm for exploring the solution space The AMOVA framework has been previously applied to identifying the best genetically homogeneous sets of geographically related populations [39] by trying to maximize the amount of genetic differentiation among groups of populations conversely minimizing the variance within groups of populations.

Computer simulations In order to test the V matrix transformation in a known graph model, we performed simulations on four populations of 10 individuals each, modelling a situation of three parental populations and one admixed population see Figure S1. Figure 2. Estimation of the sampling site differentiation based on proposed genetic clusters The Cramer's V value [55] was used for summarizing the goodness of fitness between the proposed clusters and the labelled population origin of the individuals.

Also, in order to quantify how well the genetic clusters proposed by each method differentiate each sampling location or subpopulation from all the others, we computed the Informativeness of Ancestry I n statistic [37] between each pair of sampling locations using the obtained frequency of the proposed clusters by each method: 13 Where K is the number of proposed clusters, p sc is the frequency of the cluster c in sampling location s and p tc is the frequency of the cluster c in sampling location t.

The complete methodological pipeline is depicted in Figure S2. Testing the new approach by means of computer simulations We started comparing V and D matrices in explaining the between-population variation in a simple case modelling four populations under different scenarios of distances between individuals and populations. Application of the V matrix on human genome-wide data from Europe Given these promising results obtained in the computer simulations, we applied our newly developed approach to a previously collected dataset comprising 2, individuals from 23 European subpopulations using , LD pruned genome-wide SNPs [21].