Differential Analysis of Principal Components using microsatellite data in adegenet

A few days ago I got an e-mail regarding the paper my colleagues and I submitted to the peer-reviewed journal Conservation Genetics. I was really pleased to know that it was very well received by the reviewers and the editors. At the same time, I wasn’t completely surprised that they wanted us to make some changes prior to publication.

One of the main point that the reviewers pointed out concerns the approach we used to estimates population structure based on genetic data.

Put it simply: there is a certain amount of genetic variance in our samples and we wanted to know what is the number of clusters that can best explain it. To  answer this question we used one of the most popular software used in such kind of studies: STRUCTURE. The program assigns individuals to the number of hypothetical cluster that best approximates Hardy-Weinberg and Linkage Equilibrium. Everything happens in a Bayesian computational framework and so, depending on the size of your input file, it can get pretty computational intense. Moreover, results may not be very accurate when there is a marked departure from the software’s assumptions (see Pritchard et al. 2000, Hubisz et al. 2009 and Puechmaille 2016 for further details).

My reviewers suggested to complement the genetic structure analysis with another kind of clustering algorithm that could confirm (or not!) what I observed with STRUCTURE. They suggested a couple of alternatives and I decided to try with the DAPCDifferential Analysis of Principal Components (Jombart et al 2010). The main advantage of a multivariate analysis clustering using genetic data is that it is not based on any underlying assumption. Hence the results should be immune from potential biases that may affect the STRUCTURE analysis. Besides, I never did such an analysis on my data so I figured it would be a nice complement and I may end up learning something new.

The specific advantage of DAPC is that it combines the power and of principal component analyses in partitioning the variance among individual observation, while maximizing the variance between groups via a discriminant analysis (for more details rear to Jombart et al 2010 and see this).

The package adegenet (Jombart 2008) for the R statistical software allows to run this analysis in a relatively simple way. Following is a quick tutorial to produce a nice graphical output of a dummy dataset that I produced using the software EASYPOP (Balloux 2001). I am going to assume that you already have adegenet and its dependencies installed on your machine. If that is the case you can type the code below:

 > library(adegenet)
> setwd("/path/to/your/data/file")
> df <- read.genepop("yourFileInGenepopFormat.gen", ncode = 2)
> df_dapc2 <- dapc(df, n.da=100, n.pca=100)
> temp <- optim.a.score(df_dapc2)
> df_dapc2 <- dapc(df, n.pca = 40, n.da = 100)
> scatter(df_dapc2, xax = 1, yax = 2, col = colorRampPalette(c("blue", "red","orange","green"))( 20), bg = "#DEDEDE", 
 cstar = 0, solid = .7, pch = 20, cex = 2, cell = 0, 
 scree.pca = T, posi.pca = "topleft", scree.da = T)
 

This code should produce two nice graphical output that in the case of my dataset are the following.

DAPC - optimization of a-score
DAPC – optimization of a-score
Differential Analysis of Principal Component
Differential Analysis of Principal Components

The first graph helps us selecting the appropriate number of PCs (principal components) to retain without overfitting the data (in this case if we retain 40 PCs we should be ok). The second graph is the actual representation of our data according to the differential analysis. Looking at this graph it doesn’t seem there to be a very strong differentiation in multiple groups in these 20 “populations”, although we can recognize at least two distinct “clouds” of data.

Notice that I haven’t actually asked the program to really identify what is the number of clusters that can help us explaining the variance in our data. The code to do that can be found here (and in any case is worth looking at the page cause it has way more information than what I showed in this post).

If you want to know more details on the code that I used above or you simply want to share your thoughts feel free to write a comment below.


References:

Bolloux F (2001) EASYPOP (version 1.7): a computer program for population genetics simulations. Journal of Heredity 92(3):301-2.

Hubisz MJ et al (2009) Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources 9: 1322–1332. doi: 10.1111/j.1755-0998.2009.02591.x

Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24(11):1403-5. doi: 10.1093/bioinformatics/btn129

Jombart T et al (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. Genetics 11: 94 http://www.biomedcentral.com/1471-2156/11/94

Pritchard JK et al (2000) Inference of Population Structure Using Multilocus Genotype Data. Genetics 155: 945–959

Puechmaille SJ (2016) The program STRUCTURE does not reliably recover the correct structure when sampling is uneven: subsampling and new estimators alleviate the problem. Molecular Ecology Resources 16: 608–627. doi: 10.1111/1755-0998.12512

 

Updating R on OS X

Once again I found myself dealing with one of those tasks that should almost come automatic by now, but every time I have to browse the internet for at least 15 minutes to find something that seems appropriate and, more importantly, that is well explained (of course StackOverflow was there to help). So here it goes, a short post on how to update R version on your Mac OS X (I am running El Capitan 10.11.4). At least next time I will just browse posts on my blog and find exactly what I am looking for.

R is the famous free software that allows to do all sort of statistical data analyses (more on this here). Apparently there is no easy way to update R and its packages, not even from the RStudio GUI (or at least I couldn’t find it). Honestly is not even that hard. But if you don’t know how to do it can be problematic. In my specific case I wanted to upgrade from the 3.2.0 (Full of Ingredients) to the latest 3.3.0 (Supposedly Educational) version. The greatest concern as usual is what is going to happen to all the packages that I already have in my library.

So here it goes…

  1. Download and install the latest version (or the one you want to install) from here
  2. Start RStudio, or simply start R and it should automatically run the version that you just installed. Close it.
  3. On a Mac OS X usually the R packages are stored in…
     /Library/Frameworks/R.framework/Versions/yourVersion/Resources/library 

    …where “yourVersion” is the version of R storing all the libraries.

  4. What you can do is simply copy all your old packages from the old library folder and past them in the new library folder. Note that R usually comes with some default packages, and those are pretty much standardized across versions. So when the computer asks to replace or not the files with same name you can either say yes or no. Even if you replace the default packages with the one from the previous version it should’t be too much of a problem because in the next step you will be performing an upgrade of all the packages that require an upgrade.
  5. Open RStudio and type the following command…
     > update.packages(checkBuilt=TRUE) 

    …and R should start looking of what version of the package you have installed and, if there is a newer version, it will ask if you want to update update.  Type “y” at the prompt.

  6. Once all of these is done you can type…
     > version
    
    > packageStatus() 

    …to have an idea of what you have installed on your computer.

 

Hope this is useful. I am sure it will be for me in the future.