Tag Archives: data

Nested for and else_if loop in R to create a new variable. Is there a better way to do this?

I am trying to get some input on how to make my code better/faster.

I have been working with sf objects lately, and I have to do a bunch of operations to figure out whether certain points I collected are within certain shape files and then run some correlations.

My input points are organized in a list of sf objects. For every sf object there is a data frame with a number of variables describing characteristics of the various points.

In particular, there is one variable describing whether the points were collected within or outside a certain geographic area (“IN” or “OUT”).

Unfortunately, when a point was collected in the field I had no way to test whether I really was within a specific polygon or not and I marked “IN” or “OUT” based on my personal experience and knowledge of the field site.

So, what I am doing now is empirically testing whether the points that I collected in the field are actually within the shapefile/polygon of interest. I have already figured out all the code to do so, and I am posting here a mock example to pick your brain and get insight on how I could make it faster/better.

My real data points are over 3000, organized in some 25 sf objects within a list. My mock example will have only 90 data points organized in 2 sf objects within a list

First I create 2 fake data frames and I transform them in 2 sf objects with specific CRS.

    df1 <- data.frame(id = c(paste("a_", 1:15, sep = ""), 
                            paste("b_", 1:15, sep = ""), 
                            paste("c_", 1:15, sep = "")),
                     species = rep("blabla", 45),
                     fondo = rep("pol_1", 45),
                     in_out = sample(c("in", "out"), 45, replace = T), #randomly assigning in or out variable
                     long = sample(1994764:1995265, 45, replace = T),
                     lat = sample(215659:216563, 45, replace = T))

    df_sf1 <- st_as_sf(df1,
             coords = c("long", "lat"),
             na.fail = FALSE,
             crs = "+proj=utm +zone=19 +datum=WGS84"
    )
    df2 <- data.frame(id = c(paste("a_", 1:15, sep = ""), 
                             paste("b_", 1:15, sep = ""), 
                             paste("c_", 1:15, sep = "")),
                      species = rep("blabla", 45),
                      in_out = sample(c("in", "out"), 45, replace = T),
                      long = sample(1994764:1995265, 45, replace = T),
                      lat = sample(215659:216563, 45, replace = T))

    df_sf2 <- st_as_sf(df2,
                       coords = c("long", "lat"),
                       na.fail = FALSE,
                       crs = "+proj=utm +zone=19 +datum=WGS84"
    )

Then I sample some points from the 2 dfs to create a couple of fake polygons that will substitute the shape files I am using in my real analysis. They will be transformed to sf objects as well.

    df_sf_polygon <- st_multipoint(st_coordinates(df_sf[c(29, 21, 27, 3, 38), ]))
    df_sf_polygon <- st_sf(geom = st_sfc(df_sf_polygon))
    df_sf_polygon <- st_set_crs(df_sf_polygon, "+proj=utm +zone=19 +datum=WGS84")
    df_sf_polygon <- st_cast(df_sf_polygon, "POLYGON")

    df_sf_polygon2 <- st_multipoint(st_coordinates(df_sf2[c(3, 21, 29), ]))
    df_sf_polygon2 <- st_sf(geom = st_sfc(df_sf_polygon2))
    df_sf_polygon2 <- st_set_crs(df_sf_polygon2, "+proj=utm +zone=19 +datum=WGS84")
    df_sf_polygon2 <- st_cast(df_sf_polygon2, "POLYGON")

I can plot the data so far to have a feel for what is going on.

   tm_shape(df_sf1) +
      tm_dots(col = "red",
      size = 0.8) +
      tm_shape(df_sf2) +
      tm_dots(col = "green",
      size = 0.8) +) +
      tm_shape(sp_df_sf_polygon) +
      tm_borders() +
      tm_shape(sp_df_sf_polygon2) +
      tm_borders()

As you can probably see, there are numerous pints that fall outside of the two polygons that I have (remember, this is a fake example, so it doesn’t matter if the polygons aren’t perfect or if they intersect with each other). So now I want to test if the information in the column in_out jibe with what really happened in the field. To do so I have written a nested for{} loop with if{} and else_if{} statements (in this mock example I am only going to use a for{} and an if{} statements).

The loop checks whether every single point falls within a certain polygon/shapefile. If it does the character string “IN” is added to a new variable called in_out_nest_2, if it doesn’t “OUT” is added.

    list_data_sf <- list(df_sf1 = df_sf1, df_sf2 = df_sf2) # organize sf objects in a list

And here is the loop!

    for (x in 1:length(names(list_data_sf))) {
      if (grepl("df_sf1", names(list_data_sf[x]), fixed = T) == T) {
        for (y in 1:length(list_data_sf[[x]]$id)) {
          if (list_data_sf[[x]]$id[y] %in% list_data_sf[[x]][df_sf_polygon, ]$id){
            list_data_sf[[x]]$in_out_nest_2[y] <- "IN"
          } else {
            list_data_sf[[x]]$in_out_nest_2[y] <- "OUT"
          }
        }
      } else {
        for (y in 1:length(list_data_sf[[x]]$id)) {
          if (list_data_sf[[x]]$id[y] %in% list_data_sf[[x]][df_sf_polygon2, ]$id){
            list_data_sf[[x]]$in_out_nest_2[y] <- "IN"
          } else {
            list_data_sf[[x]]$in_out_nest_2[y] <- "OUT"
          }
        }
      }
    }

The code seem to be working just fine, even though, with the data that I will be collecting in the future, it could become really slow. I am open/interested in suggestions to make this code faster/better.

Thanks for all the help!

P.S. Below is my sessionInfo()

    sessionInfo()
    #> R version 3.5.2 (2018-12-20)
    #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
    #> Running under: macOS  10.15.5
    #> 
    #> Matrix products: default
    #> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
    #> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
    #> 
    #> locale:
    #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    #> 
    #> attached base packages:
    #> [1] stats     graphics  grDevices utils     datasets  methods   base     
    #> 
    #>
    #> other attached packages:
    #>  [1] reprex_0.3.0        viridis_0.5.1       viridisLite_0.3.0  
    #>  [4] vegan_2.5-6         permute_0.9-5       units_0.6-3        
    #>  [7] forcats_0.5.0       stringr_1.4.0       purrr_0.3.4        
    #> [10] readr_1.3.1         tidyverse_1.3.0     tidyr_1.1.0        
    #> [13] tibble_3.0.1        tmap_3.0            spatstat_1.64-1    
    #> [16] rpart_4.1-15        nlme_3.1-148        spatstat.data_1.4-3
    #> [19] sf_0.9-4            scales_1.1.1        RMySQL_0.10.20     
    #> [22] DBI_1.1.0           Rmisc_1.5           rgdal_1.5-12       
    #> [25] readxl_1.3.1        plyr_1.8.6          moveVis_0.10.5     
    #> [28] lubridate_1.7.9     leaflet_2.0.3       lattice_0.20-41    
    #> [31] hms_0.5.3           hydroTSM_0.6-0      xts_0.12-0         
    #> [34] zoo_1.8-8           GISTools_0.7-4      rgeos_0.5-3        
    #> [37] MASS_7.3-51.6       RColorBrewer_1.1-2  maptools_1.0-1     
    #> [40] ggspatial_1.1.3     ggridges_0.5.2      ggrepel_0.8.2      
    #> [43] ggmap_3.0.0         ggplot2_3.3.2       dplyr_1.0.0        
    #> [46] dismo_1.1-4         raster_3.1-5        sp_1.4-2           
    #> [49] data.table_1.12.8   corrplot_0.84       colorspace_1.4-1   
    #> [52] chron_2.3-55        anytime_0.3.7
    #>      
    #> loaded via a namespace (and not attached):
    #>  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.5.0
    #>  [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3   highr_0.8      
    #>  [9] knitr_1.29      stringr_1.4.0   xfun_0.15       digest_0.6.25  
    #> [13] rlang_0.4.6     evaluate_0.14

Allele frequency and genotype manipulation in R

As I am trying to wrap my dissertation up I am going over all my data and analyses again and  again. No need to say that every time I look at numbers and how I generated them something new wrong or unexpected always pops up. The other day, for example, I was working with some of the genotype files that I am using in my analyses. I was trying to review allele frequencies and calculate again multi-locus-heterozygosity. Since I was doing all of this in RStudio I finally forced myself to write an R function that would the hardest part of the job for me, and save me time and headaches.

I borrowed some code from here and adapted it for my purposes. The function takes a data frame containing microsatellite genotype information as input and does 2 major things: first it calculates allele frequencies and produces a graphical output; second, it formats the data so that they can be processed by another package, Rhh, for multi-locus-heterozygosity. The output of this function are two files that will be added to your global environment. The first file, named OUT contains the allele frequencies. The second file, named MyDataMLHready contains the data formatted for Rhh (review the Rhh help file to run the MLH analysis).

Take a look at the actual function here. Copy the text and run it on your R or RStudio to load the function in your working environment. You can download some microsatellite genotyped data from here and test the code yourself. You will have to import the .txt file in R.

If the function doesn’t work or you have suggestions on how to make it better or add new functionalities don’t hesitate to write a comment.

 

It could go worse, your external hard drive could fail you…

Yes, this is exactly how it sounds! A disaster!! Actually not entirely, but it still is pretty m***********g annoying. I have always been extra careful with my portable external hard drive. I have always made sure that I ejected it before the unplugging phase. I have always kept it dry and safe. I never dropped it, if not 3 moths ago, on my bed (meaning that it landed on a very confortable layer of clothes that were sitting there waiting to be folded). And still, yesterday night I go to connect it and do my routine daily back up, but that little f****r wouldn’t show up on my desktop. I tried CPR, mouth-to-mouth, disk utility, disk repair…nothing. It wouldn’t come back to life. And it is pretty new (bought it December 2014), and from that famous brand that start with “S” and ends in “amsung”. There were absolutely no signs indicating that this tragedy may have happened. And yet, in the very moment when your bank account is screaming because you just paid your university tuitions and the rent…in that part of your career as a graduate student when you need all the data that you have been accumulating in these years cause you are writing your dissertation…yes, when you really don’t need a shit like that, it will happen.

The only good thing is that I am not entirely new to this kind of situations. That is why I try to organize all my data in such a way that the vital stuff is always backed up on the cloud (god bless Dropbox). Yes, now I have to hope that in the next couple of days nothing happens to my computer, until a new, hopefully more reliable, hard-drive shows up.