Reputation: 728
I am trying to find the most efficient way to solve the following puzzle in R without having to use nested for
loops (nested for
loops would take forever):
Let's say we have 2 data frames d_zone2
and stops
. Among their columns are the three ones: lat
, long
and zone,
which describe the positions of certain points on the map divided into different polygons. The zone
column in d_zone2
is all initialized as NA
. Now, I want to assign the correct value to each element in the zone
column in d_zone2
by the rule: for each pair of lat-lon in d_zone2
, I assign the zone
element in stops
if the pair of lat-lon corresponds to that zone
element in stops
has the minimum Euclidean distance to the pair of lat-lon in d_zone2
.
The initial solution I thought of is arranging the data frame stops
in an increasing order based on lat
, and then long
. Then for each pair of lat-lon
in d_zone2
, I can use nested for
loops to go through all the successive pairs oflat-lon
to determine where my lat-lon
in d_zone2
is. The code is the following:
for(i in 1:nrow(d_zone2)){
for(j in 1:nrow(stops)){
if(d_zone2$Lat[i] >= stops$Lat[j] && d_zone2$Long[i] >= stops$Lat[j] && d_zone2$Lat[i]<= stops$Lat[j+1] && d_zone2$Long[i] <= stops$Lat[j+1]){
d_zone2$X8[i] = stops$X8[j];
}
}
}
However, I realized that this is not quite right, because d_zone2$X8[i]
might belong to stops$X8[j+1]
(since its lat-lon
might be closer to that of stops$X8[j+1]
compared to stops$X8[j]
). Thus, I think the only valid approach is to find which pair of lat-lon
in stops
that gives the minimum Euclidean distance to a pair of lat-lon in d_zone2
. But I don't know how to do this in R without using nested for()
loops.
2nd approach: Another approach is to take advantage of the list of polygons created stored in the zone.csv
file below. Now, the solution would be to select out the bucket where a pair of lat-lon
in d_zone2
and stops
fall into, then just assign the zone number assigned in stops$X8
for that particular lat-lon
pair to the element in d_zone2
.
Question: Could anyone please help me solve this puzzle using either the Euclidean or 2nd approach demonstrated above? I want to use dplyr::select(dplyr::left_join(x = d_zone2, y = stops%>% select("Lat", "Long", X8), by = ...)
, but I am not sure how to fill in the true condition for by=??
.RData file containing both data frames d_zone2 and stops. Warning: Quite large files!
Upvotes: 1
Views: 264
Reputation: 23919
This takes about 15 seconds because we filter out duplicate points first. This leaves us with 5457 unique points in d_zone2
. For each of them we compute the distance to all stops and get the index of the stop with the minimum distance. Afterwards you can match the zones by stop id to all 19228939 points.
library(sp)
library(data.table)
setDT(d_zone2)
stop_points <- as.matrix(stops[, 3:2])
short <- unique(d_zone2, by = c("Long", "Lat"))
short[, ZONE := stops[which.min(spDists(x = stop_points, y = cbind(Long, Lat))),]$X8, by=.(Long, Lat)]
Upvotes: 3