user3077008
user3077008

Reputation: 847

Fastest Way to Get Geo Distance for Large Dataset in R

I have two datasets that contain longitude and latitude. The large dataset has about 20M observations and the small dataset has 36K observations. I am trying to find the number of observations from the small dataset that fall within 200 meters of each point in the large dataset. So the process is

  1. Take one geocode from the large dataset
  2. Calculate the distance to every point in the small dataset
  3. Count how many points fall within 200 meters

The problem is that it take a very long time to complete this task in R. I use distm function from geosphere package, but it is still pretty slow. It easily exceeds 24 hours.

Here is the code that I used:

library(geosphere)

#dataset1 (large)
df1 <- data.frame(longitude = c(-77.14239, -77.10750, -77.14239, -77.01797, -77.17203, -77.47230, -77.26490, -77.02824, -76.96993, -77.03185),
latitude = c(38.80575, 38.87987, 38.80575, 38.90425, 38.77076, 38.98140, 38.92800, 38.90436, 38.84569, 38.92080))

#dataset2 (small)
df2 <- data.frame(longitude = c(-75.34186, -123.59649, -108.20089, -115.16004, -87.62970),
                 latitude = c(40.11899, 44.38151, 36.71881, 36.22207, 41.71438))

# get the number of obs that fall within 200m

for(i in 1:nrow(df1)){
  df1$n_within200m[i] <- sum(as.numeric(distm(cbind(df1$longitude[i], df1$latitude[i]),
 cbind(df2$longitude, df2$latitude)) < 200))}

Would there be a faster way to complete this task?

Upvotes: 1

Views: 728

Answers (1)

Nullius in Verba
Nullius in Verba

Reputation: 21

This code gives an approximate answer, and is intended mainly to explain the idea.

First, using latitudes and longitudes is not the best way to store the data, as the scale is non-uniform with latitude and things get messy close to the poles. A much better approach is to store the data as Cartesian xyz, based on an origin at the centre of the Earth. (This coordinate system is commonly called 'Earth-centred Earth-fixed' or ECEF.) We can then calculate the Cartesian distance between points much more simply. I have generated my test data directly in xyz coordinates to save time - hopefully you will have no difficulty converting your data.

A fast way to search for matches in large data sets is to use a hash. We divide the data up into blocks, 10 km on a side, and encode each block location as a unique integer. Points which are close together will usually have the same hash number, and for any given hash there will usually only be a small number of pairs in that single block to search through. The number will be far more manageable than searching all the data.

To keep the explanation simple, I've just used one hash. This means that pairs that cross the boundary of the 10 km blocks will be missed. To fix this, we need to generate 4 hashes, each offset by a quarter of the block size in each direction. H1 = Hash(x,y,z), H2 = Hash(x+2.5,y+2.5,z+2.5), H3 = Hash(x+5,y+5,z+5), H4 = Hash(x+7.5,y+7.5,z+7.5). If two points match on any of the hashes, then the pair needs to be checked. Generate the list of matches for each hash, combine lists with 'rbind', and then use 'unique' to remove duplicates.

Re <- 6371 # Earth's mean radius in km

# Define function to make some test data
# Uniformly distributed over surface of the Earth
# Include row number
# Lat/Long is inconvenient because distance is non-uniform around the poles
# So use Cartesian xyz instead
# Easy to convert lat/long to xyz
# Use (Re*cos(lat)*cos(lng),Re*cos(lat)*sin(lng),Re*sin(lat))

makeData <- function(n) {
  pts <- data.frame(Row=seq(n),x=rnorm(n),y=rnorm(n),z=rnorm(n))
  r <- sqrt(pts$x^2+pts$y^2+pts$z^2)
  pts$x <- Re*pts$x/r
  pts$y <- Re*pts$y/r
  pts$z <- Re*pts$z/r

  return(pts)
}

# Generate test data, 20 million points and 40 thousand points respectively
ptsA <- makeData(20000000)
ptsB <- makeData(40000)

# Generate an integer hash of the xyz position based on 10 km boxes
ptsA$hash <- floor(ptsA$x/10)+638*floor(ptsA$y/10)+638^2*floor(ptsA$z/10)
ptsB$hash <- floor(ptsB$x/10)+638*floor(ptsB$y/10)+638^2*floor(ptsB$z/10)

# Identify pairs of rows with a matching hash
matches <- merge(ptsA,ptsB,by="hash")

# Calculate distance between points for each pair identified
matches$dist <- with(matches,sqrt((x.x-x.y)^2+(y.x-y.y)^2+(z.x-z.y)^2))

# Filter out point pairs within 200 m of one another
matches <- matches[matches$dist < 0.2,]

# For each observed value of PtsA Row, count the number of matches
Counts <- aggregate(matches$Row.y,by=list(PtsARow=matches$Row.x),length)

# Initialise a column in PtsA with zeroes,
# then fill in the non-zero values found
ptsA$Count <- 0
ptsA[Counts$PtsARow,"Count"] <- Counts$x

# 'matches' contains a list of the matching points found
matches[order(matches$Row.x),][1:20,] # show first 20 matches

# Frequency table of counts
table(ptsA$Count)

I make no claims that this is the fastest way to do it, but it will certainly run a lot faster than 24 hours.

Upvotes: 2

Related Questions