Fastest (parallelized) join of large datasets by nearest coordinates?

Question

I'm doing SQL style join of two large, fixed size (lat,long) coordinate datasets by nearest neighbor search. Currently I'm using dplyr and data.table to do this. How can I optimize and parallelize my code for absolute runtime?

Previous attempts include native python, pandas, and multiprocessing which ended up being very slow. My current solution, using data.table to construct a table of nearest neighbors and dplyr to join based on this table, is the quickest but is still too slow.

library(dplyr)
library(data.table)
library(geosphere)

source <- data.table(lat = runif(1e3), long = runif(1e3)) %>% mutate(nrow = row_number())
dest <- data.table(lat = runif(5e4), long = runif(5e4)) %>% mutate(ind = row_number())
dest_mat <- as.matrix(dest[, c('long', 'lat')])
setDT(source)
# function that returns the index of the closest matching point in dest
mindist_ind <- function(source_lat, source_long) { return(which.min(distHaversine(c(source_long, source_lat), dest_mat))) }


nn_inds <- source[, j = list(ind = mindist_ind(lat, long)), by = 1:nrow(source)] # slowest line, gets index of nearest match in dest
nn_matches <- inner_join(nn_inds, dest, by = 'ind') # join final back to dest in order to get all nearest matches
sourcedest_matches <- inner_join(source, nn_matches, by = 'nrow') # join nearest matches to source by index

The source file is ~89 million rows, and dest is roughly ~50k rows. Current timing for various source sizes are as follows:

1000 rows -> 46 seconds
10000 rows -> 270 seconds
100000 rows -> 2580 seconds
1000000 rows -> 17172 seconds

While this is the quickest I've been able to get, for the full 89 million source file it would take an estimated 17-18 days to run which is far too long. I'm running this on a r4.16xlarge AWS EC2 instance, with 488 GB RAM, 32 cores, and 64 vCPUs. How can I optimize/parallelize this code to run quicker?

Fastest (parallelized) join of large datasets by nearest coordinates?

Answers (1)

Related Questions