JD Long
JD Long

Reputation: 60756

Larger than memory operations: Spatial joins with R

I have 300 million points I want to intersect with 60 million polygons. The combination of these two is larger than what I can easily fit into memory on my machine. I have spiked out a solution where I load each dataset into PostGIS, perform a spatial index on each, then perform the spatial join.

In PostGIS that looks like:

SELECT pts.*, grid.gridID
into test_join
FROM pts, grid
WHERE ST_Contains( grid.geometry, pts.geometry);

The spatial index on pts (300 million points) takes about 90 minutes. Then the join above takes ~190 minutes.

I have never dealt with larger than RAM spatial data with R previously.

  1. Are there ways of dealing with this scale of data using the sf package in R
  2. What strategies exist for speeding up this operation?
  3. Should I be considering other tools or approaches?

My preference is to stay with open source tools (R, PostGIS, Python, etc). But I am not committed to any particular tool chain.

Additional Data It seems that my lack of illustrating a specific solution has caused confusion. The reason I had not initially provided any syntax or examples is that I am not wedded to a specific platform. I'm open to ideas using any open source stack. As the title says, and I reiterate in the text, the issue here is scale, not syntax to solve a trivial example.

Here is a very specific solution solved using the sf package in R. The example below is for a US grid of 500km square and 1000 random points. I'd like to scale this to sub 1km grids and 300,000,000 points. I don't care about plotting at all but I plot a few things below for illustration only.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
library(tidyverse)
library(spData)
#> To access larger datasets in this package, install the spDataLarge
#> package with: `install.packages('spDataLarge',
#> repos='https://nowosad.github.io/drat/', type='source'))`

# size of squares in projection units (in this case meters)
grid_size <- 500000
num_pts <- 1000   # number of points to join

data(us_states) # loads the us_states shape

all_states <-
  us_states %>%
  # st_sf() %>%
  st_transform(102003) %>% # project to a meters based projection
  st_combine   %>% #flattens the shape file to one big outline (no states)
  st_buffer(10000) # add a 10k buffer

#a nice outter buffer of the usa lower 48
ggplot() +
  geom_sf(data = all_states)

## let's put a grid over the whole US
state_box <- st_bbox(all_states)
xrange <- state_box$xmax - state_box$xmin
yrange <- state_box$ymax - state_box$ymin

cell_dim <-
  c(ceiling(xrange / grid_size),
    ceiling(yrange / grid_size)) # dimension of polygons necessary

full_us_grid <-
  st_make_grid(all_states, square = TRUE, n = cell_dim) %>%
  st_intersection(all_states) %>% # only the inside part
  st_sf() %>%
  mutate(grid_id = 1:n())

ggplot() +
  geom_sf(data = full_us_grid)

## now let's create some random points
random_pts <- data.frame(
  point_id = 1:num_pts,
  lat = runif(num_pts, 30, 50),
  lon = runif(num_pts, -117, -78)
) %>%
  # these are in degrees so need crs in same
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(102003)  # transform into our metric crs

ggplot() +
  geom_sf(data = full_us_grid) +
  geom_sf(data = random_pts)

## here is the spatial join!!
joined_data <-
  full_us_grid %>%
  st_join(random_pts)

## this is the mapping from grid_id to point_id
joined_data %>%
  st_set_geometry(NULL) %>%
  na.omit()  %>%
  head
#>     grid_id point_id
#> 7         7       26
#> 7.1       7      322
#> 7.2       7      516
#> 7.3       7      561
#> 7.4       7      641
#> 7.5       7      680

Created on 2018-12-24 by the reprex package (v0.2.1)

Upvotes: 2

Views: 1400

Answers (2)

lbusett
lbusett

Reputation: 5932

In this particular case (finding which points lie within rectangular cells) you can get both a speed boost and a reduction of memory requirements by building a QuadTree using function createTree in package SearchTrees and then looking for points-in-cell using its rectLookup function. This way you both spare memory (no need to build a polygon grid), and increase speed since after building the QuadTreee rectLookup is very fast since it considerably reduces the number of coordinates comparisons to be done. For example:

library(sf)
library(spData)
library(SearchTrees)
library(data.table)
library(ggplot2)

data(us_states) # loads the us_states shape
all_states <-
  us_states %>%
  # st_sf() %>%
  st_transform(102003) %>% # project to a meters based projection
  st_combine()   %>% #flattens the shape file to one big outline (no states)
  st_buffer(10000) # add a 10k buffer

# define the grid - no need to create a polygon grid, which is memory intensinve
# for small grids. Just get the bbox, compute number of cells and assign a unique
# index to each
#
grid_size <- 500000
state_box <- st_bbox(all_states)
xrange <- state_box$xmax - state_box$xmin
yrange <- state_box$ymax - state_box$ymin
cell_dim <-
  c(ceiling(xrange / grid_size),
    ceiling(yrange / grid_size))
n_cells <- cell_dim[1] * cell_dim[2]
ind_rows <- ceiling(1:n_cells / cell_dim[1])
ind_cols <- (1:n_cells) - (ind_rows - 1) * cell_dim[1]
cell_indexes <- data.frame(grid_id = 1:n_cells,
                           ind_row = ind_rows,
                           ind_col = ind_cols,
                           stringsAsFactors = FALSE)

## now let's create some random points - Here I build the points directly in 
## 102003 projection for speed reasons because st_transform() does not scale 
## very well with number of points. If your points are in 4326 you may consider 
## transforming them beforehand and store the results in a RData or gpkg or 
## shapefile. I also avoid creating a `sf` object to save memory: a plain x-y-id 
## data.table suffices
set.seed(1234)
t1 <- Sys.time()
num_pts <- 3000
random_pts <- data.table::data.table(
  point_id = 1:num_pts,
  lon = runif(num_pts, state_box$xmin, state_box$xmax),
  lat = runif(num_pts, state_box$ymin, state_box$ymax)
)

# Build a Quadtree over the points.
qtree <- SearchTrees::createTree(random_pts, columns = c(2,3))

# Define a function which uses `SearchTrees::rectLookup` to find points within 
# a given grid cell. Also deal with "corner cases": cells outside all_states and
# cells only partially within all_states.

find_points <- function(cell, qtree, random_pts, state_box, all_states, grid_size, cell_indexes) {

  cur_xmin <- state_box[["xmin"]] + grid_size * (cell_indexes$ind_col[cell] - 1)
  cur_xmax <- state_box[["xmin"]] + grid_size * (cell_indexes$ind_col[cell])
  cur_ymin <- state_box[["ymin"]] + grid_size * (cell_indexes$ind_row[cell] - 1)
  cur_ymax <- state_box[["ymin"]] + grid_size * (cell_indexes$ind_row[cell])

  cur_bbox <- sf::st_bbox(c(xmin = cur_xmin, xmax = cur_xmax,
                            ymin = cur_ymin, ymax = cur_ymax),
                          crs = sf::st_crs(all_states)) %>%
    sf::st_as_sfc()

  # look for contained points only if the cell intersects with the all_states poly
  if (lengths(sf::st_intersects(cur_bbox, all_states)) != 0) {

    if (lengths(sf::st_contains(all_states, cur_bbox)) != 0) {
      # If cell completely contained, use `rectLookup` to find contained points
      pts <-  SearchTrees::rectLookup(
        qtree,
        xlims = c(cur_xmin, cur_xmax),
        ylims = c(cur_ymin, cur_ymax))


    } else {
      # If cell intersects, but is not completely contained (i.e., on borders),
      # limit the rectLookup to the bbox of intersection to speed-up, then find
      # points properly contained
      cur_bbox <- sf::st_bbox(sf::st_intersection(all_states, cur_bbox))
      pts <-  SearchTrees::rectLookup(
        qtree,
        xlims = c(cur_bbox[["xmin"]], cur_bbox[["xmax"]]),
        ylims = c(cur_bbox[["ymin"]], cur_bbox[["ymax"]]))
      # now we should have "few" points - we can use sf operators - here st_contains
      # is much faster than an intersect. This should be fast even over large
      # number of points if the cells are small
      contained_pts <- sf::st_contains(
        all_states,
        sf::st_as_sf(random_pts[pts,],
                     coords = c("lon", "lat"),
                     crs = sf::st_crs(all_states)))[[1]]
      pts  <- random_pts[pts[contained_pts],][["point_id"]]
    }
    if (length(pts) == 0 ) {
      pts <- as.numeric(NA)
    } else {
      pts  <- random_pts$point_id[pts]
    }
  } else {
    pts <- as.numeric(NA)
  }
  out <- data.table::data.table(
    grid_id  = cell_indexes$grid_id[cell],
    point_id = pts)
  return(out)
}

Let’see if it works:

# Run the function through a `lapply` over grid cells

out <- lapply(1:n_cells, FUN = function(x)
  find_points(x, qtree, random_pts, state_box, all_states, grid_size,cell_indexes))
out <- data.table::rbindlist(out)
out
#>       grid_id point_id
#>    1:       1       NA
#>    2:       2       NA
#>    3:       3       NA
#>    4:       4      325
#>    5:       4     1715
#>   ---                 
#> 1841:      59     1058
#> 1842:      60      899
#> 1843:      60     2044
#> 1844:      60      556
#> 1845:      60     2420

grd <- sf::st_make_grid(all_states, cellsize = 500000) %>% 
  sf::st_sf() %>% 
  dplyr::mutate(grid_id = 1:60)

id_sub = c(5, 23)
sub_pts <- out[grid_id %in% id_sub]
sub_pts <- dplyr::left_join(sub_pts, random_pts) %>%
  sf::st_as_sf(coords = c("lon", "lat"), crs = st_crs(all_states))
#> Joining, by = "point_id"

ggplot2::ggplot(data = grd) +
  geom_sf(data = grd, fill = "transparent") +
  geom_sf_text(aes(label = grid_id)) +
  geom_sf(data = all_states, fill = "transparent") +
  geom_sf(data = sub_pts)

In my (limited) experience, this should scale pretty well over number of points / cells and has a reasonably low memory footprint. In addition, it is easily parallelizable (provided you have enough memory).

If you still do not manage to run it on the full dataset (I could not test it on my laptop), you could also “split” the execution by analyzing the points in “chunks” (for example, by saving them to a shp/gpkg and then reading only a part of the points using the query argument, or saving as a table ordered by lon and reading the first XX rows - this could give you a further speed-up if you filter on longitude/latitude, because then you could also reduce automatically the number of cells to be analyzed, and save much time.

Upvotes: 2

Ankur Goel
Ankur Goel

Reputation: 311

Try using the cloud solution as described in the link below:

https://blog.sicara.com/speedup-r-rstudio-parallel-cloud-performance-aws-96d25c1b13e2

Upvotes: -1

Related Questions