A few days ago I got an e-mail regarding the paper my colleagues and I submitted to the peer-reviewed journal Conservation Genetics. I was really pleased to know that it was very well received by the reviewers and the editors. At the same time, I wasn’t completely surprised that they wanted us to make some changes prior to publication.
One of the main point that the reviewers pointed out concerns the approach we used to estimates population structure based on genetic data.
Put it simply: there is a certain amount of genetic variance in our samples and we wanted to know what is the number of clusters that can best explain it. To answer this question we used one of the most popular software used in such kind of studies: STRUCTURE. The program assigns individuals to the number of hypothetical cluster that best approximates Hardy-Weinberg and Linkage Equilibrium. Everything happens in a Bayesian computational framework and so, depending on the size of your input file, it can get pretty computational intense. Moreover, results may not be very accurate when there is a marked departure from the software’s assumptions (see Pritchard et al. 2000, Hubisz et al. 2009 and Puechmaille 2016 for further details).
My reviewers suggested to complement the genetic structure analysis with another kind of clustering algorithm that could confirm (or not!) what I observed with STRUCTURE. They suggested a couple of alternatives and I decided to try with the DAPC – Differential Analysis of Principal Components (Jombart et al 2010). The main advantage of a multivariate analysis clustering using genetic data is that it is not based on any underlying assumption. Hence the results should be immune from potential biases that may affect the STRUCTURE analysis. Besides, I never did such an analysis on my data so I figured it would be a nice complement and I may end up learning something new.
The specific advantage of DAPC is that it combines the power and of principal component analyses in partitioning the variance among individual observation, while maximizing the variance between groups via a discriminant analysis (for more details rear to Jombart et al 2010 and see this).
The package adegenet (Jombart 2008) for the R statistical software allows to run this analysis in a relatively simple way. Following is a quick tutorial to produce a nice graphical output of a dummy dataset that I produced using the software EASYPOP (Balloux 2001). I am going to assume that you already have adegenet and its dependencies installed on your machine. If that is the case you can type the code below:
> library(adegenet) > setwd("/path/to/your/data/file") > df <- read.genepop("yourFileInGenepopFormat.gen", ncode = 2) > df_dapc2 <- dapc(df, n.da=100, n.pca=100) > temp <- optim.a.score(df_dapc2) > df_dapc2 <- dapc(df, n.pca = 40, n.da = 100) > scatter(df_dapc2, xax = 1, yax = 2, col = colorRampPalette(c("blue", "red","orange","green"))( 20), bg = "#DEDEDE", cstar = 0, solid = .7, pch = 20, cex = 2, cell = 0, scree.pca = T, posi.pca = "topleft", scree.da = T)
This code should produce two nice graphical output that in the case of my dataset are the following.
The first graph helps us selecting the appropriate number of PCs (principal components) to retain without overfitting the data (in this case if we retain 40 PCs we should be ok). The second graph is the actual representation of our data according to the differential analysis. Looking at this graph it doesn’t seem there to be a very strong differentiation in multiple groups in these 20 “populations”, although we can recognize at least two distinct “clouds” of data.
Notice that I haven’t actually asked the program to really identify what is the number of clusters that can help us explaining the variance in our data. The code to do that can be found here (and in any case is worth looking at the page cause it has way more information than what I showed in this post).
If you want to know more details on the code that I used above or you simply want to share your thoughts feel free to write a comment below.
Bolloux F (2001) EASYPOP (version 1.7): a computer program for population genetics simulations. Journal of Heredity 92(3):301-2.
Hubisz MJ et al (2009) Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources 9: 1322–1332. doi: 10.1111/j.1755-0998.2009.02591.x
Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24(11):1403-5. doi: 10.1093/bioinformatics/btn129
Jombart T et al (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. Genetics 11: 94 http://www.biomedcentral.com/1471-2156/11/94
Pritchard JK et al (2000) Inference of Population Structure Using Multilocus Genotype Data. Genetics 155: 945–959
Puechmaille SJ (2016) The program STRUCTURE does not reliably recover the correct structure when sampling is uneven: subsampling and new estimators alleviate the problem. Molecular Ecology Resources 16: 608–627. doi: 10.1111/1755-0998.12512