Tag Archives: R

Differential Analysis of Principal Components using microsatellite data in adegenet

A few days ago I got an e-mail regarding the paper my colleagues and I submitted to the peer-reviewed journal Conservation Genetics. I was really pleased to know that it was very well received by the reviewers and the editors. At the same time, I wasn’t completely surprised that they wanted us to make some changes prior to publication.

One of the main point that the reviewers pointed out concerns the approach we used to estimates population structure based on genetic data.

Put it simply: there is a certain amount of genetic variance in our samples and we wanted to know what is the number of clusters that can best explain it. To  answer this question we used one of the most popular software used in such kind of studies: STRUCTURE. The program assigns individuals to the number of hypothetical cluster that best approximates Hardy-Weinberg and Linkage Equilibrium. Everything happens in a Bayesian computational framework and so, depending on the size of your input file, it can get pretty computational intense. Moreover, results may not be very accurate when there is a marked departure from the software’s assumptions (see Pritchard et al. 2000, Hubisz et al. 2009 and Puechmaille 2016 for further details).

My reviewers suggested to complement the genetic structure analysis with another kind of clustering algorithm that could confirm (or not!) what I observed with STRUCTURE. They suggested a couple of alternatives and I decided to try with the DAPCDifferential Analysis of Principal Components (Jombart et al 2010). The main advantage of a multivariate analysis clustering using genetic data is that it is not based on any underlying assumption. Hence the results should be immune from potential biases that may affect the STRUCTURE analysis. Besides, I never did such an analysis on my data so I figured it would be a nice complement and I may end up learning something new.

The specific advantage of DAPC is that it combines the power and of principal component analyses in partitioning the variance among individual observation, while maximizing the variance between groups via a discriminant analysis (for more details rear to Jombart et al 2010 and see this).

The package adegenet (Jombart 2008) for the R statistical software allows to run this analysis in a relatively simple way. Following is a quick tutorial to produce a nice graphical output of a dummy dataset that I produced using the software EASYPOP (Balloux 2001). I am going to assume that you already have adegenet and its dependencies installed on your machine. If that is the case you can type the code below:

 > library(adegenet)
> setwd("/path/to/your/data/file")
> df <- read.genepop("yourFileInGenepopFormat.gen", ncode = 2)
> df_dapc2 <- dapc(df, n.da=100, n.pca=100)
> temp <- optim.a.score(df_dapc2)
> df_dapc2 <- dapc(df, n.pca = 40, n.da = 100)
> scatter(df_dapc2, xax = 1, yax = 2, col = colorRampPalette(c("blue", "red","orange","green"))( 20), bg = "#DEDEDE", 
 cstar = 0, solid = .7, pch = 20, cex = 2, cell = 0, 
 scree.pca = T, posi.pca = "topleft", scree.da = T)

This code should produce two nice graphical output that in the case of my dataset are the following.

DAPC - optimization of a-score
DAPC – optimization of a-score
Differential Analysis of Principal Component
Differential Analysis of Principal Components

The first graph helps us selecting the appropriate number of PCs (principal components) to retain without overfitting the data (in this case if we retain 40 PCs we should be ok). The second graph is the actual representation of our data according to the differential analysis. Looking at this graph it doesn’t seem there to be a very strong differentiation in multiple groups in these 20 “populations”, although we can recognize at least two distinct “clouds” of data.

Notice that I haven’t actually asked the program to really identify what is the number of clusters that can help us explaining the variance in our data. The code to do that can be found here (and in any case is worth looking at the page cause it has way more information than what I showed in this post).

If you want to know more details on the code that I used above or you simply want to share your thoughts feel free to write a comment below.


Bolloux F (2001) EASYPOP (version 1.7): a computer program for population genetics simulations. Journal of Heredity 92(3):301-2.

Hubisz MJ et al (2009) Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources 9: 1322–1332. doi: 10.1111/j.1755-0998.2009.02591.x

Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24(11):1403-5. doi: 10.1093/bioinformatics/btn129

Jombart T et al (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. Genetics 11: 94 http://www.biomedcentral.com/1471-2156/11/94

Pritchard JK et al (2000) Inference of Population Structure Using Multilocus Genotype Data. Genetics 155: 945–959

Puechmaille SJ (2016) The program STRUCTURE does not reliably recover the correct structure when sampling is uneven: subsampling and new estimators alleviate the problem. Molecular Ecology Resources 16: 608–627. doi: 10.1111/1755-0998.12512


Updating R on OS X

Once again I found myself dealing with one of those tasks that should almost come automatic by now, but every time I have to browse the internet for at least 15 minutes to find something that seems appropriate and, more importantly, that is well explained (of course StackOverflow was there to help). So here it goes, a short post on how to update R version on your Mac OS X (I am running El Capitan 10.11.4). At least next time I will just browse posts on my blog and find exactly what I am looking for.

R is the famous free software that allows to do all sort of statistical data analyses (more on this here). Apparently there is no easy way to update R and its packages, not even from the RStudio GUI (or at least I couldn’t find it). Honestly is not even that hard. But if you don’t know how to do it can be problematic. In my specific case I wanted to upgrade from the 3.2.0 (Full of Ingredients) to the latest 3.3.0 (Supposedly Educational) version. The greatest concern as usual is what is going to happen to all the packages that I already have in my library.

So here it goes…

  1. Download and install the latest version (or the one you want to install) from here
  2. Start RStudio, or simply start R and it should automatically run the version that you just installed. Close it.
  3. On a Mac OS X usually the R packages are stored in…

    …where “yourVersion” is the version of R storing all the libraries.

  4. What you can do is simply copy all your old packages from the old library folder and past them in the new library folder. Note that R usually comes with some default packages, and those are pretty much standardized across versions. So when the computer asks to replace or not the files with same name you can either say yes or no. Even if you replace the default packages with the one from the previous version it should’t be too much of a problem because in the next step you will be performing an upgrade of all the packages that require an upgrade.
  5. Open RStudio and type the following command…
     > update.packages(checkBuilt=TRUE) 

    …and R should start looking of what version of the package you have installed and, if there is a newer version, it will ask if you want to update update.  Type “y” at the prompt.

  6. Once all of these is done you can type…
     > version
    > packageStatus() 

    …to have an idea of what you have installed on your computer.


Hope this is useful. I am sure it will be for me in the future.

Understanding Genetic Drift with the help of R.

A long standing debate among evolutionary biologists concerns the contribution of genetic drift to evolving populations. Fisher and Wright were the first scientists with different opinions on this topic. The former was in favor of selection as major engine of population evolution, while the latter argued that genetic drift could have a paramount effect, especially in small populations.

What is genetic drift? Genetic drift is the random loss of genetic variability within populations, generation after generation. By “random” I mean that it is impossible (or almost impossible) to predict the directionality of this process (i.e. whether an allele will increase or reduce in frequency). This, though, doesn’t prevent us to try and quantify the effect of genetic drift on allele frequency. As we shall see, the size-effect of genetic drift is strictly dependent on population size.

Lets assume that we have a very large population. Individuals have been genotyped for a di-allelic locus (A and a). Turns out that the frequency of A is p = 0.5 and the frequency of a is, hence,  1 – p. One generation goes by. How big of allele frequency change we expect to see in this new generation? Lets try to work this out, and always remind that we are only looking at genetic drift and not considering any other evolutionary mechanism like selection or the insurgence of new mutations.

A different way to ask the above question is: what is the probability of having the exact same allele frequency in the new generation if we sample 2N individuals? (two indicates that we are working with a diploid species.)

For example, if we extract 50000 gametes, we need exactly 25000 of them to be A in order to maintain p = 0.5. This is a classical binomial example in that we only have two possible outcomes (allele A and a), the probability of each outcome stays the same every time we extract a gamete (we allow replacement), and every gamete extraction is independent from the next one. So, what is the probability that extracting 50000 gametes we get exactly 25000 successes, i.e. 25000 times allele A?  In R is actually pretty easy. The following code will calculate the binomial probability of having exactly 25000 successes over 50000 trials and giving a probability of success of 0.5.

> dbinom(25000, 50000, 0.5)
[1] 0.00356823

Well. That probability is rather small. Would this change if we sample less individuals? We can try with ten, in which case we would need 5 A alleles to maintain p = 0.5

> dbinom(5, 10, 0.5)
[1] 0.2460938

According to the above data it seems that the smaller the number of individuals in the second generation, the higher the probability that we will have exactly the same allele frequency. This is only half the truth. That is, although the probability of maintaining p = 0.5  increases with a reduce number of individual, what changes is the variance around this frequency. Lets graph these results to have a better understanding of what I mean. Following is a little routine I wrote that will calculate a series of binomial probabilities and plots them against allele frequencies.

> k <- sort(c(2, 5, 10, 20, 50, 100, 500, 1000, 5000, 10000, 20000, 50000), decreasing = T)
> par(mfrow = c(3, 4))
> for (i in k) {
   y <- dbinom(0:i,i,0.5)
   x <- (seq(1,length(0:i)))/length(0:i)
   plot(x, y, xlim = c(0,1), col = "blue", ylab = "Probability", xlab = "Allele Frequency",
   main = paste("Probability Mass Function\n 0 < k <", i, ", p = 0.5"))
   abline(v = 0.5, lty = 2, lwd = 3,col = "red")

The output is this nice collection of bell curves. From top left to bottom right we are varying sample sizes.

K denotes number of successes

These graphs should tell us two very important things. First, no matter how big the sample, there will always be a randomly associated change in allele frequency. Second, the magnitude of this change (the width of the bell curve) grows with smaller sample sizes.

So, this is why genetic drift is supposedly considered a much stronger evolutionary agent in small populations, where the fluctuations of allele frequency is going to be rather large and could bring alleles to fixation or lost in a small number of generations.



Allele frequency and genotype manipulation in R

As I am trying to wrap my dissertation up I am going over all my data and analyses again and  again. No need to say that every time I look at numbers and how I generated them something new wrong or unexpected always pops up. The other day, for example, I was working with some of the genotype files that I am using in my analyses. I was trying to review allele frequencies and calculate again multi-locus-heterozygosity. Since I was doing all of this in RStudio I finally forced myself to write an R function that would the hardest part of the job for me, and save me time and headaches.

I borrowed some code from here and adapted it for my purposes. The function takes a data frame containing microsatellite genotype information as input and does 2 major things: first it calculates allele frequencies and produces a graphical output; second, it formats the data so that they can be processed by another package, Rhh, for multi-locus-heterozygosity. The output of this function are two files that will be added to your global environment. The first file, named OUT contains the allele frequencies. The second file, named MyDataMLHready contains the data formatted for Rhh (review the Rhh help file to run the MLH analysis).

Take a look at the actual function here. Copy the text and run it on your R or RStudio to load the function in your working environment. You can download some microsatellite genotyped data from here and test the code yourself. You will have to import the .txt file in R.

If the function doesn’t work or you have suggestions on how to make it better or add new functionalities don’t hesitate to write a comment.