Tag Archives: coding

Nested for and else_if loop in R to create a new variable. Is there a better way to do this?

I am trying to get some input on how to make my code better/faster.

I have been working with sf objects lately, and I have to do a bunch of operations to figure out whether certain points I collected are within certain shape files and then run some correlations.

My input points are organized in a list of sf objects. For every sf object there is a data frame with a number of variables describing characteristics of the various points.

In particular, there is one variable describing whether the points were collected within or outside a certain geographic area (“IN” or “OUT”).

Unfortunately, when a point was collected in the field I had no way to test whether I really was within a specific polygon or not and I marked “IN” or “OUT” based on my personal experience and knowledge of the field site.

So, what I am doing now is empirically testing whether the points that I collected in the field are actually within the shapefile/polygon of interest. I have already figured out all the code to do so, and I am posting here a mock example to pick your brain and get insight on how I could make it faster/better.

My real data points are over 3000, organized in some 25 sf objects within a list. My mock example will have only 90 data points organized in 2 sf objects within a list

First I create 2 fake data frames and I transform them in 2 sf objects with specific CRS.

    df1 <- data.frame(id = c(paste("a_", 1:15, sep = ""), 
                            paste("b_", 1:15, sep = ""), 
                            paste("c_", 1:15, sep = "")),
                     species = rep("blabla", 45),
                     fondo = rep("pol_1", 45),
                     in_out = sample(c("in", "out"), 45, replace = T), #randomly assigning in or out variable
                     long = sample(1994764:1995265, 45, replace = T),
                     lat = sample(215659:216563, 45, replace = T))

    df_sf1 <- st_as_sf(df1,
             coords = c("long", "lat"),
             na.fail = FALSE,
             crs = "+proj=utm +zone=19 +datum=WGS84"
    )
    df2 <- data.frame(id = c(paste("a_", 1:15, sep = ""), 
                             paste("b_", 1:15, sep = ""), 
                             paste("c_", 1:15, sep = "")),
                      species = rep("blabla", 45),
                      in_out = sample(c("in", "out"), 45, replace = T),
                      long = sample(1994764:1995265, 45, replace = T),
                      lat = sample(215659:216563, 45, replace = T))

    df_sf2 <- st_as_sf(df2,
                       coords = c("long", "lat"),
                       na.fail = FALSE,
                       crs = "+proj=utm +zone=19 +datum=WGS84"
    )

Then I sample some points from the 2 dfs to create a couple of fake polygons that will substitute the shape files I am using in my real analysis. They will be transformed to sf objects as well.

    df_sf_polygon <- st_multipoint(st_coordinates(df_sf[c(29, 21, 27, 3, 38), ]))
    df_sf_polygon <- st_sf(geom = st_sfc(df_sf_polygon))
    df_sf_polygon <- st_set_crs(df_sf_polygon, "+proj=utm +zone=19 +datum=WGS84")
    df_sf_polygon <- st_cast(df_sf_polygon, "POLYGON")

    df_sf_polygon2 <- st_multipoint(st_coordinates(df_sf2[c(3, 21, 29), ]))
    df_sf_polygon2 <- st_sf(geom = st_sfc(df_sf_polygon2))
    df_sf_polygon2 <- st_set_crs(df_sf_polygon2, "+proj=utm +zone=19 +datum=WGS84")
    df_sf_polygon2 <- st_cast(df_sf_polygon2, "POLYGON")

I can plot the data so far to have a feel for what is going on.

   tm_shape(df_sf1) +
      tm_dots(col = "red",
      size = 0.8) +
      tm_shape(df_sf2) +
      tm_dots(col = "green",
      size = 0.8) +) +
      tm_shape(sp_df_sf_polygon) +
      tm_borders() +
      tm_shape(sp_df_sf_polygon2) +
      tm_borders()

As you can probably see, there are numerous pints that fall outside of the two polygons that I have (remember, this is a fake example, so it doesn’t matter if the polygons aren’t perfect or if they intersect with each other). So now I want to test if the information in the column in_out jibe with what really happened in the field. To do so I have written a nested for{} loop with if{} and else_if{} statements (in this mock example I am only going to use a for{} and an if{} statements).

The loop checks whether every single point falls within a certain polygon/shapefile. If it does the character string “IN” is added to a new variable called in_out_nest_2, if it doesn’t “OUT” is added.

    list_data_sf <- list(df_sf1 = df_sf1, df_sf2 = df_sf2) # organize sf objects in a list

And here is the loop!

    for (x in 1:length(names(list_data_sf))) {
      if (grepl("df_sf1", names(list_data_sf[x]), fixed = T) == T) {
        for (y in 1:length(list_data_sf[[x]]$id)) {
          if (list_data_sf[[x]]$id[y] %in% list_data_sf[[x]][df_sf_polygon, ]$id){
            list_data_sf[[x]]$in_out_nest_2[y] <- "IN"
          } else {
            list_data_sf[[x]]$in_out_nest_2[y] <- "OUT"
          }
        }
      } else {
        for (y in 1:length(list_data_sf[[x]]$id)) {
          if (list_data_sf[[x]]$id[y] %in% list_data_sf[[x]][df_sf_polygon2, ]$id){
            list_data_sf[[x]]$in_out_nest_2[y] <- "IN"
          } else {
            list_data_sf[[x]]$in_out_nest_2[y] <- "OUT"
          }
        }
      }
    }

The code seem to be working just fine, even though, with the data that I will be collecting in the future, it could become really slow. I am open/interested in suggestions to make this code faster/better.

Thanks for all the help!

P.S. Below is my sessionInfo()

    sessionInfo()
    #> R version 3.5.2 (2018-12-20)
    #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
    #> Running under: macOS  10.15.5
    #> 
    #> Matrix products: default
    #> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
    #> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
    #> 
    #> locale:
    #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    #> 
    #> attached base packages:
    #> [1] stats     graphics  grDevices utils     datasets  methods   base     
    #> 
    #>
    #> other attached packages:
    #>  [1] reprex_0.3.0        viridis_0.5.1       viridisLite_0.3.0  
    #>  [4] vegan_2.5-6         permute_0.9-5       units_0.6-3        
    #>  [7] forcats_0.5.0       stringr_1.4.0       purrr_0.3.4        
    #> [10] readr_1.3.1         tidyverse_1.3.0     tidyr_1.1.0        
    #> [13] tibble_3.0.1        tmap_3.0            spatstat_1.64-1    
    #> [16] rpart_4.1-15        nlme_3.1-148        spatstat.data_1.4-3
    #> [19] sf_0.9-4            scales_1.1.1        RMySQL_0.10.20     
    #> [22] DBI_1.1.0           Rmisc_1.5           rgdal_1.5-12       
    #> [25] readxl_1.3.1        plyr_1.8.6          moveVis_0.10.5     
    #> [28] lubridate_1.7.9     leaflet_2.0.3       lattice_0.20-41    
    #> [31] hms_0.5.3           hydroTSM_0.6-0      xts_0.12-0         
    #> [34] zoo_1.8-8           GISTools_0.7-4      rgeos_0.5-3        
    #> [37] MASS_7.3-51.6       RColorBrewer_1.1-2  maptools_1.0-1     
    #> [40] ggspatial_1.1.3     ggridges_0.5.2      ggrepel_0.8.2      
    #> [43] ggmap_3.0.0         ggplot2_3.3.2       dplyr_1.0.0        
    #> [46] dismo_1.1-4         raster_3.1-5        sp_1.4-2           
    #> [49] data.table_1.12.8   corrplot_0.84       colorspace_1.4-1   
    #> [52] chron_2.3-55        anytime_0.3.7
    #>      
    #> loaded via a namespace (and not attached):
    #>  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.5.0
    #>  [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3   highr_0.8      
    #>  [9] knitr_1.29      stringr_1.4.0   xfun_0.15       digest_0.6.25  
    #> [13] rlang_0.4.6     evaluate_0.14

Anybody knows how to deal with UIPeakerView objects?

Hello everyone.
I am writing a small iOS app that I hope would facilitate my field work. I am on OS X El Capitan 10.11.4, using Xcode 7.3 and writing the app for iOS 9.3.

Here is the whole idea: instead of plugging the data in to a piece of paper I want to write everything in to a basic form-like app to be used on an iPad, and then have the app transfer the data to a .csv file.

I have pretty much everything figured out (the structure of the app, the protocol to transfer and retrieve data etc. etc.), but there is one thing that is driving me crazy. Before I show you the code that I am using, here is the problem. When I click the Save button and later on go to inspect the saved data, I cannot get the value selected by the UIPickerView objects to be transferred to the .csv file. Everything else works fine. I have two UIPickerView objects. In one I have sex information, coded as M, F, and M/F; the other has information about age coded as Adu, Juv, Hat. What I would like is for the app to save the value that is selected, but I couldn’t quite figure that out.

Any help is much appreciated. Thanks.

Here is the code that I am using for my ViewController.h

#import <UIKit/UIKit.h>

@interface ViewController : UIViewController<UIPickerViewDataSource, UIPickerViewDelegate>
@property (strong, nonatomic) IBOutlet UIDatePicker *day;
@property (strong, nonatomic) IBOutlet UITextField *species;
@property (strong, nonatomic) IBOutlet UITextField *island;
@property (strong, nonatomic) IBOutlet UITextField *cay;
@property (strong, nonatomic) IBOutlet UITextField *pit;
@property (strong, nonatomic) IBOutlet UITextField *coordinatesN;
@property (strong, nonatomic) IBOutlet UITextField *coordinatesW;
@property (strong, nonatomic) IBOutlet UIPickerView *sex;
@property (strong, nonatomic) IBOutlet UIPickerView *age;

- (IBAction)saveInfo:(id)sender;
- (IBAction)retrieveInfo:(id)sender;
- (IBAction)retractKeyboard:(id)sender;

@property (strong, nonatomic) IBOutlet UITextView *resultView;

@end

And this is the code that I am using for my ViewConntroller.m


#import "ViewController.h"

@interface ViewController ()
{
NSArray *pickerDataSex;
NSArray *pickerDataAge;
}
@end

@implementation ViewController

@synthesize resultView;
@synthesize day;
@synthesize species;
@synthesize island;
@synthesize cay;
@synthesize pit;
@synthesize coordinatesW;
@synthesize coordinatesN;
@synthesize sex;
@synthesize age;

- (IBAction)retractKeyboard:(id)sender {
[self resignFirstResponder];
}

- (IBAction)saveInfo:(id)sender {
NSString *resultLine=[NSString stringWithFormat:@"%@,%@,%@,%@,%@,%@,%@,%@,%@\n",
self.day.date,
self.species.text,
self.island.text,
self.cay.text,
self.pit.text,
self.coordinatesN.text,
self.coordinatesW.text,
self.sex.dataSource,
self.age.dataSource];
NSString *docPath =[NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES )objectAtIndex:0];

//resultView.text = docPath;}

NSString *surveys=[docPath stringByAppendingPathComponent:@"results.csv"];

if (![[NSFileManager defaultManager] fileExistsAtPath:surveys]) {
[[NSFileManager defaultManager] createFileAtPath:surveys contents:nil attributes:nil];
}
NSFileHandle *fileHandle = [NSFileHandle fileHandleForUpdatingAtPath:surveys];
[fileHandle seekToEndOfFile];
[fileHandle writeData:[resultLine dataUsingEncoding:NSUTF8StringEncoding]];
[fileHandle closeFile];
self.species.text=@"";
self.island.text=@"";
self.cay.text=@"";
self.pit.text=@"";
self.coordinatesN.text=@"";
self.coordinatesW.text=@"";
NSLog(@"info saved");
}

- (IBAction)retrieveInfo:(id)sender {
NSString *docPath =[NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES )objectAtIndex:0];
//resultView.text = docPath;
NSString *surveys=[docPath stringByAppendingPathComponent:@"results.csv"];

if ([[NSFileManager defaultManager] fileExistsAtPath: surveys]) {
NSFileHandle *fileHandle = [NSFileHandle fileHandleForReadingAtPath:surveys];
NSString *surveyResults = [[NSString alloc]initWithData:[fileHandle availableData] encoding:NSUTF8StringEncoding];
[fileHandle closeFile];
self.resultView.text = surveyResults;
}

}

- (void)viewDidLoad {
[super viewDidLoad];
// Do any additional setup after loading the view, typically from a nib.

//Initialize picker data
pickerDataSex = @[@"M",@"F",@"M/F"];
pickerDataAge = @[@"Adu",@"Juv",@"Hat"];

//Connect data to picker
self.sex.dataSource = self;
self.sex.delegate = self;

self.age.dataSource = self;
self.age.delegate = self;

}

- (void)didReceiveMemoryWarning {
[super didReceiveMemoryWarning];
// Dispose of any resources that can be recreated.
}

//Number of columns of the data
- (NSInteger)numberOfComponentsInPickerView: (UIPickerView *)pickerView
{
if (pickerView==self.sex){
return 1;
} else if (pickerView==self.age){
return 1;
}

return 0;
}

//Number of rows of data
- (NSInteger)pickerView:(UIPickerView *)pickerView numberOfRowsInComponent:(NSInteger)component
{
if (pickerView==self.sex){
return pickerDataSex.count;
} else if (pickerView==self.age){
return pickerDataAge.count;
}
return 0;
}

//The data to return for the row and component (column) that's being passed
-(NSString*)pickerView:(UIPickerView *)pickerView titleForRow:(NSInteger)row forComponent:(NSInteger)component
{
if (pickerView==self.sex){
return pickerDataSex[row];
} else if (pickerView==self.age){
return pickerDataAge[row];
}
return 0;
}

@end

Understanding Genetic Drift with the help of R.

A long standing debate among evolutionary biologists concerns the contribution of genetic drift to evolving populations. Fisher and Wright were the first scientists with different opinions on this topic. The former was in favor of selection as major engine of population evolution, while the latter argued that genetic drift could have a paramount effect, especially in small populations.

What is genetic drift? Genetic drift is the random loss of genetic variability within populations, generation after generation. By “random” I mean that it is impossible (or almost impossible) to predict the directionality of this process (i.e. whether an allele will increase or reduce in frequency). This, though, doesn’t prevent us to try and quantify the effect of genetic drift on allele frequency. As we shall see, the size-effect of genetic drift is strictly dependent on population size.

Lets assume that we have a very large population. Individuals have been genotyped for a di-allelic locus (A and a). Turns out that the frequency of A is p = 0.5 and the frequency of a is, hence,  1 – p. One generation goes by. How big of allele frequency change we expect to see in this new generation? Lets try to work this out, and always remind that we are only looking at genetic drift and not considering any other evolutionary mechanism like selection or the insurgence of new mutations.

A different way to ask the above question is: what is the probability of having the exact same allele frequency in the new generation if we sample 2N individuals? (two indicates that we are working with a diploid species.)

For example, if we extract 50000 gametes, we need exactly 25000 of them to be A in order to maintain p = 0.5. This is a classical binomial example in that we only have two possible outcomes (allele A and a), the probability of each outcome stays the same every time we extract a gamete (we allow replacement), and every gamete extraction is independent from the next one. So, what is the probability that extracting 50000 gametes we get exactly 25000 successes, i.e. 25000 times allele A?  In R is actually pretty easy. The following code will calculate the binomial probability of having exactly 25000 successes over 50000 trials and giving a probability of success of 0.5.


> dbinom(25000, 50000, 0.5)
[1] 0.00356823

Well. That probability is rather small. Would this change if we sample less individuals? We can try with ten, in which case we would need 5 A alleles to maintain p = 0.5


> dbinom(5, 10, 0.5)
[1] 0.2460938

According to the above data it seems that the smaller the number of individuals in the second generation, the higher the probability that we will have exactly the same allele frequency. This is only half the truth. That is, although the probability of maintaining p = 0.5  increases with a reduce number of individual, what changes is the variance around this frequency. Lets graph these results to have a better understanding of what I mean. Following is a little routine I wrote that will calculate a series of binomial probabilities and plots them against allele frequencies.


> k <- sort(c(2, 5, 10, 20, 50, 100, 500, 1000, 5000, 10000, 20000, 50000), decreasing = T)
> par(mfrow = c(3, 4))
> for (i in k) {
   y <- dbinom(0:i,i,0.5)
   x <- (seq(1,length(0:i)))/length(0:i)
   plot(x, y, xlim = c(0,1), col = "blue", ylab = "Probability", xlab = "Allele Frequency",
   main = paste("Probability Mass Function\n 0 < k <", i, ", p = 0.5"))
   abline(v = 0.5, lty = 2, lwd = 3,col = "red")
 }

The output is this nice collection of bell curves. From top left to bottom right we are varying sample sizes.

GeneticDrift_BinomialSampleSize
K denotes number of successes

These graphs should tell us two very important things. First, no matter how big the sample, there will always be a randomly associated change in allele frequency. Second, the magnitude of this change (the width of the bell curve) grows with smaller sample sizes.

So, this is why genetic drift is supposedly considered a much stronger evolutionary agent in small populations, where the fluctuations of allele frequency is going to be rather large and could bring alleles to fixation or lost in a small number of generations.

 

 

Allele frequency and genotype manipulation in R

As I am trying to wrap my dissertation up I am going over all my data and analyses again and  again. No need to say that every time I look at numbers and how I generated them something new wrong or unexpected always pops up. The other day, for example, I was working with some of the genotype files that I am using in my analyses. I was trying to review allele frequencies and calculate again multi-locus-heterozygosity. Since I was doing all of this in RStudio I finally forced myself to write an R function that would the hardest part of the job for me, and save me time and headaches.

I borrowed some code from here and adapted it for my purposes. The function takes a data frame containing microsatellite genotype information as input and does 2 major things: first it calculates allele frequencies and produces a graphical output; second, it formats the data so that they can be processed by another package, Rhh, for multi-locus-heterozygosity. The output of this function are two files that will be added to your global environment. The first file, named OUT contains the allele frequencies. The second file, named MyDataMLHready contains the data formatted for Rhh (review the Rhh help file to run the MLH analysis).

Take a look at the actual function here. Copy the text and run it on your R or RStudio to load the function in your working environment. You can download some microsatellite genotyped data from here and test the code yourself. You will have to import the .txt file in R.

If the function doesn’t work or you have suggestions on how to make it better or add new functionalities don’t hesitate to write a comment.

 

Bash script for genomic NGS data analysis

This script has been designed with TextWrangler and tested on a MacOSX Yosemite 10.10.4. The script ties together the I/O of 5 different pieces of software for MiSeq paired end reads analysis: FastQC, Trimmomatic-0.33, seqtk, Abyss-1.9.0 and Quast-3.1.
In order for the script to work the above softwares have to be installed  locally, and the relative commands should be able to run
from any directory on your computer.  There is no guarantee that the script is going to produce the desired results so you should use it at your own risk. Download the script and place it in a folder of your choice. Open Terminal application  and type at the prompt: ./ngs_miseq_analysis.sh. Follow the instructions as you go
along. Note that unless the files to be analyzed are in the same folder of this script you will have to specify the full path.

#!/bin/bash

##########################################################################################
##########################################################################################
# This script has been designed with TextWrangler and tested on a MacOSX Yosemite 10.10.4
#
# The script ties together the I/O of 5 different pieces of software for
# MiSeq paired end reads analysis: FastQC, Trimmomatic-0.33, seqtk, Abyss-1.9.0 and
# Quast-3.1.
#
# In order for the script to work the above softwares have to be installed
# locally, and the relative commands should be able to run
# from any directory on your computer.
#
# There is no guarantee that the script is going to produce the desired results so you
# should use it at your own risk.
#
# Download the script and place it in a folder of your choice. Open Terminal application
# and type at the prompt: ./ngs_miseq_analysis.sh. Follow the instructions as you go
# along.
#
# Note that unless the files to be analyzed are in the same folder of this script you
# will have to specify the full path.
##########################################################################################
##########################################################################################

echo "Type the name of your fwd .fastq.gz followed by [ENTER]:"

read filename1

echo
echo "Checking..."
echo
if [ -f $filename1 ]
then
echo "The fwd file exists."
else
echo "Sorry I could not find this file! I quit!"
exit [1]
fi
echo
echo "Type the name of your rev .fastq.gz followed by [ENTER]:"

read filename2
echo
echo "Checking..."
echo
if [ -f $filename2 ]
then
echo "The rev file exists."
else
echo "Sorry I could not find this file! I quit!"
exit [1]
fi
echo
echo "Now I start my analysis with FastQC."
echo

fastqc $filename1 $filename2

open *.html
echo

echo "Check your FastQC .html report please."
read -p "Can I go ahead with Trimmomatic? ([y]es or [n]o)" -n 1 -r
echo
echo
if [[ $REPLY =~ ^[Yy]$ ]]
then
java -jar /Applications/Trimmomatic-0.33/trimmomatic-0.33.jar PE -phred33 $filename1 $filename2 out1p.fastq.gz out1u.fastq.gz out2p.fastq.gz out2u.fastq.gz ILLUMINACLIP:/Applications/Trimmomatic-0.33/adapters/adapt.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:10:30 MINLEN:40
rm out*u.fastq.gz
else [[ $REPLY =~ ^[Nn]$ ]]
echo "Sorry about that, I quit!"
exit [1]
fi

echo "The trimming and cleaning steps are done."
echo "I will run again FastQC on the trimmed files"
echo
fastqc out1p.fastq.gz out2p.fastq.gz
echo
echo "Check your FastQC .html report please."

read -p "Do you think the sequences are good enough now? ([y]es or [n]o)" -n 1 -r
echo
echo
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Now I will randomly subsample 100000 reads from your files."
seqtk sample -s100 out1p.fastq.gz 100000 > out1p_samp100K.fq
seqtk sample -s100 out2p.fastq.gz 100000 > out2p_samp100K.fq
else [[ $REPLY =~ ^[Nn]$ ]]
echo "Sorry about that, I quit!"
exit [1]
fi

echo
echo "The subsampling is done. Let me assemble the mtDNA genome for you!"
echo
export k
for k in {45..96}; do
mkdir k$k
abyss-pe -C k$k name=mtDNA_species in='../out1p_samp100K.fq ../out2p_samp100K.fq'
done
abyss-fac k*/mtDNA_species-contigs.fa
echo
echo "I have built a bunch of genomes using different k-mers. Now let me evaluate with Quast which one is the best."

python /Applications/quast-3.1/quast.py -o quast_results k*/mtDNA_species-contigs.fa

open ./quast_results/report.html