I am trying to get some input on how to make my code better/faster.
I have been working with sf objects lately, and I have to do a bunch of operations to figure out whether certain points I collected are within certain shape files and then run some correlations.
My input points are organized in a list of sf objects. For every sf object there is a data frame with a number of variables describing characteristics of the various points.
In particular, there is one variable describing whether the points were collected within or outside a certain geographic area (“IN” or “OUT”).
Unfortunately, when a point was collected in the field I had no way to test whether I really was within a specific polygon or not and I marked “IN” or “OUT” based on my personal experience and knowledge of the field site.
So, what I am doing now is empirically testing whether the points that I collected in the field are actually within the shapefile/polygon of interest. I have already figured out all the code to do so, and I am posting here a mock example to pick your brain and get insight on how I could make it faster/better.
My real data points are over 3000, organized in some 25 sf objects within a list. My mock example will have only 90 data points organized in 2 sf objects within a list
First I create 2 fake data frames and I transform them in 2 sf objects with specific CRS.
df1 <- data.frame(id = c(paste("a_", 1:15, sep = ""), paste("b_", 1:15, sep = ""), paste("c_", 1:15, sep = "")), species = rep("blabla", 45), fondo = rep("pol_1", 45), in_out = sample(c("in", "out"), 45, replace = T), #randomly assigning in or out variable long = sample(1994764:1995265, 45, replace = T), lat = sample(215659:216563, 45, replace = T)) df_sf1 <- st_as_sf(df1, coords = c("long", "lat"), na.fail = FALSE, crs = "+proj=utm +zone=19 +datum=WGS84" ) df2 <- data.frame(id = c(paste("a_", 1:15, sep = ""), paste("b_", 1:15, sep = ""), paste("c_", 1:15, sep = "")), species = rep("blabla", 45), in_out = sample(c("in", "out"), 45, replace = T), long = sample(1994764:1995265, 45, replace = T), lat = sample(215659:216563, 45, replace = T)) df_sf2 <- st_as_sf(df2, coords = c("long", "lat"), na.fail = FALSE, crs = "+proj=utm +zone=19 +datum=WGS84" )
Then I sample some points from the 2 dfs to create a couple of fake polygons that will substitute the shape files I am using in my real analysis. They will be transformed to sf objects as well.
df_sf_polygon <- st_multipoint(st_coordinates(df_sf[c(29, 21, 27, 3, 38), ])) df_sf_polygon <- st_sf(geom = st_sfc(df_sf_polygon)) df_sf_polygon <- st_set_crs(df_sf_polygon, "+proj=utm +zone=19 +datum=WGS84") df_sf_polygon <- st_cast(df_sf_polygon, "POLYGON") df_sf_polygon2 <- st_multipoint(st_coordinates(df_sf2[c(3, 21, 29), ])) df_sf_polygon2 <- st_sf(geom = st_sfc(df_sf_polygon2)) df_sf_polygon2 <- st_set_crs(df_sf_polygon2, "+proj=utm +zone=19 +datum=WGS84") df_sf_polygon2 <- st_cast(df_sf_polygon2, "POLYGON")
I can plot the data so far to have a feel for what is going on.
tm_shape(df_sf1) + tm_dots(col = "red", size = 0.8) + tm_shape(df_sf2) + tm_dots(col = "green", size = 0.8) +) + tm_shape(sp_df_sf_polygon) + tm_borders() + tm_shape(sp_df_sf_polygon2) + tm_borders()

As you can probably see, there are numerous pints that fall outside of the two polygons that I have (remember, this is a fake example, so it doesn’t matter if the polygons aren’t perfect or if they intersect with each other). So now I want to test if the information in the column in_out jibe with what really happened in the field. To do so I have written a nested for{} loop with if{} and else_if{} statements (in this mock example I am only going to use a for{} and an if{} statements).
The loop checks whether every single point falls within a certain polygon/shapefile. If it does the character string “IN” is added to a new variable called in_out_nest_2, if it doesn’t “OUT” is added.
list_data_sf <- list(df_sf1 = df_sf1, df_sf2 = df_sf2) # organize sf objects in a list And here is the loop! for (x in 1:length(names(list_data_sf))) { if (grepl("df_sf1", names(list_data_sf[x]), fixed = T) == T) { for (y in 1:length(list_data_sf[[x]]$id)) { if (list_data_sf[[x]]$id[y] %in% list_data_sf[[x]][df_sf_polygon, ]$id){ list_data_sf[[x]]$in_out_nest_2[y] <- "IN" } else { list_data_sf[[x]]$in_out_nest_2[y] <- "OUT" } } } else { for (y in 1:length(list_data_sf[[x]]$id)) { if (list_data_sf[[x]]$id[y] %in% list_data_sf[[x]][df_sf_polygon2, ]$id){ list_data_sf[[x]]$in_out_nest_2[y] <- "IN" } else { list_data_sf[[x]]$in_out_nest_2[y] <- "OUT" } } } }
The code seem to be working just fine, even though, with the data that I will be collecting in the future, it could become really slow. I am open/interested in suggestions to make this code faster/better.
Thanks for all the help!
P.S. Below is my sessionInfo()
sessionInfo() #> R version 3.5.2 (2018-12-20) #> Platform: x86_64-apple-darwin15.6.0 (64-bit) #> Running under: macOS 10.15.5 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> #> other attached packages: #> [1] reprex_0.3.0 viridis_0.5.1 viridisLite_0.3.0 #> [4] vegan_2.5-6 permute_0.9-5 units_0.6-3 #> [7] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 #> [10] readr_1.3.1 tidyverse_1.3.0 tidyr_1.1.0 #> [13] tibble_3.0.1 tmap_3.0 spatstat_1.64-1 #> [16] rpart_4.1-15 nlme_3.1-148 spatstat.data_1.4-3 #> [19] sf_0.9-4 scales_1.1.1 RMySQL_0.10.20 #> [22] DBI_1.1.0 Rmisc_1.5 rgdal_1.5-12 #> [25] readxl_1.3.1 plyr_1.8.6 moveVis_0.10.5 #> [28] lubridate_1.7.9 leaflet_2.0.3 lattice_0.20-41 #> [31] hms_0.5.3 hydroTSM_0.6-0 xts_0.12-0 #> [34] zoo_1.8-8 GISTools_0.7-4 rgeos_0.5-3 #> [37] MASS_7.3-51.6 RColorBrewer_1.1-2 maptools_1.0-1 #> [40] ggspatial_1.1.3 ggridges_0.5.2 ggrepel_0.8.2 #> [43] ggmap_3.0.0 ggplot2_3.3.2 dplyr_1.0.0 #> [46] dismo_1.1-4 raster_3.1-5 sp_1.4-2 #> [49] data.table_1.12.8 corrplot_0.84 colorspace_1.4-1 #> [52] chron_2.3-55 anytime_0.3.7 #> #> loaded via a namespace (and not attached): #> [1] compiler_3.5.2 magrittr_1.5 tools_3.5.2 htmltools_0.5.0 #> [5] yaml_2.2.1 stringi_1.4.6 rmarkdown_2.3 highr_0.8 #> [9] knitr_1.29 stringr_1.4.0 xfun_0.15 digest_0.6.25 #> [13] rlang_0.4.6 evaluate_0.14