Reputation: 125
I have a list of co-ordinates of certain bus stops in this format
Bus_Stop_ID lat long
A -34.04199 18.61747
B -33.92312 18.44649
I then have a list of certain shops
Shop_ID lat long
1 -34.039350 18.617964
2 -33.927820 18.410520
I would like to check whether the shops fall within a 500 metre radius from the bus stop. Ultimately, the final dataset would look something like this where the Bus_Stop column indicates T/F and Bus_Stop_ID shows the relevant BUS ID(s) for that shop if Bus_Stop == T -
Shop_ID lat long Bus_Stop Bus_ID
1 -34.039350 18.617964 TRUE A
2 -33.927820 18.410520 FALSE #NA
Does anyone have an idea about how I can go about this using R? I've seen the package geosphere
but have struggled to understand it given my relative inexperience in the spatial domain. Any ideas or packages you could recommend? Thank you
Upvotes: 2
Views: 1545
Reputation: 7435
The previous answer (still included below) is not suited for large data sets. The reason is that we need to compute the distance for each pair of shops
and bus
. Therefore both the memory and computation scale as O(N*M)
for N
shops and M
buses. A more scalable solution uses a data structure such as a KD-Tree to perform nearest neighbor search for each shop. The advantage here is that the computational complexity becomes O(M*logM)
for building the KD-Tree for the bus stops and O(N*logM)
for searching the nearest neighbor for each shop.
To do this, we can use nn2
from the RANN
package. The complication here is that nn2
deals only with Euclidean distances and does not know anything about lat/long. Therefore, we need to convert the lat/long coordinates to some map projection (i.e. UTM) in order to use it correctly (i.e., in order to compute the Euclidean distance between shops and bus stops correctly).
Note: The following borrows heavily from Josh O'Brien's solutions for determining the UTM zone from a longitude and for converting lat/long to UTM, so he should take a bow.
## First define a function from Josh OBrien's answer to convert
## a longitude to its UTM zone
long2UTM <- function(long) {
(floor((long + 180)/6) %% 60) + 1
}
## Assuming that all points are within a zone (within 6 degrees in longitude),
## we use the first shop's longitude to get the zone.
z <- long2UTM(shops[1,"long"])
library(sp)
library(rgdal)
## convert the bus lat/long coordinates to UTM for the computed zone
## using the other Josh O'Brien linked answer
bus2 <- bus
coordinates(bus2) <- c("long", "lat")
proj4string(bus2) <- CRS("+proj=longlat +datum=WGS84")
bus.xy <- spTransform(bus2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
## convert the shops lat/long coordinates to UTM for the computed zone
shops2 <- shops
coordinates(shops2) <- c("long", "lat")
proj4string(shops2) <- CRS("+proj=longlat +datum=WGS84")
shops.xy <- spTransform(shops2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
library(RANN)
## find the nearest neighbor in bus.xy@coords for each shops.xy@coords
res <- nn2(bus.xy@coords, shops.xy@coords, 1)
## res$nn.dist is a vector of the distance to the nearest bus.xy@coords for each shops.xy@coords
## res$nn.idx is a vector of indices to bus.xy of the nearest bus.xy@coords for each shops.xy@coords
shops$Bus_Stop <- res$nn.dists <= 500
shops$Bus_ID <- ifelse(res$nn.dists <= 500, bus[res$nn.idx,"Bus_Stop_ID"], NA)
Although more complicated, this approach is much better suited for realistic problems where you may have large numbers of shops and bus stops. Using the same supplied data:
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
You can do this using the package geosphere
. Here, I'm assuming that your first data frame is named bus
, and your second data frame is named shops
:
library(geosphere)
g <- expand.grid(1:nrow(shops), 1:nrow(bus))
d <- matrix(distGeo(shops[g[,1],c("long","lat")], bus[g[,2],c("long","lat")]),
nrow=nrow(shops))
shops$Bus_Stop <- apply(d, 1, function(x) any(x <= 500))
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
Notes:
expand.grid
to enumerate all pair combinations of shops
and bus
stops. These are ordered by shops
first.d
using geosphere::distGeo
. Note here that the input expects (lon, lat) coordinates. distGeo
returns distances in meters. The resulting d
matrix is now(shops)
by now(bus)
so that each row gives the distance from a shop to each bus stop.any(x <= 500)
for each row x
in d
using apply
with MARGIN=1
.d
(corresponding to the row in bus
) for the first shop that is within 500 meters using which
instead of any
in our applied function. Then use this result to select the Bus_Stop_ID
from bus
.By the way, we don't have to apply
the condition x <= 500
twice. The following will also work:
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
shops$Bus_Stop <- !is.na(shops$Bus_ID)
and is more efficient.
Data:
bus <- structure(list(Bus_Stop_ID = structure(1:2, .Label = c("A", "B"
), class = "factor"), lat = c(-34.04199, -33.92312), long = c(18.61747,
18.44649)), .Names = c("Bus_Stop_ID", "lat", "long"), class = "data.frame", row.names = c(NA,
-2L))
shops <- structure(list(Shop_ID = 1:2, lat = c(-34.03935, -33.92782),
long = c(18.617964, 18.41052), Bus_ID = structure(c(1L, NA
), .Label = c("A", "B"), class = "factor"), Bus_Stop = c(TRUE,
FALSE)), .Names = c("Shop_ID", "lat", "long", "Bus_ID", "Bus_Stop"
), row.names = c(NA, -2L), class = "data.frame")
Upvotes: 4
Reputation: 11
My first approach would be to just use Euclidean distance and check whether the resulting value is greater or equal 0.
You could then use an IF clause and check T/F conditions.
I hope this helps.
PS: In my imagination, a distance of 500m would be a rather flat representation of the Earth's surface, so I don't think it's needed to use some geoid packages.
Upvotes: 1