Reputation: 6323
I have the two following datasets:
houses <- data.table(house_number = c(1:3),
lat_decimal = seq(1.1, 1.3, by = 0.1),
lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
lat_decimal = seq(1, 2, by = 0.1),
lon_decimal = seq(2, 3, by = 0.1))
I want to merge houses
and stations
together such that the resulting station_number
is the station that's closest to the corresponding house_number
.
This question is very similar, but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm
from the geosphere
package).
I have never worked with the outer function. In case the answer from the aforementioned question would work, how can I adapt the answer to use the distm
function instead of the sqrt
function?
Upvotes: 2
Views: 700
Reputation: 16090
Use match_nrst_haversine
from hutilscpp
:
library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
lon_decimal,
addresses_lat = stations$lat_decimal,
addresses_lon = stations$lon_decimal,
Index = stations$station_numer,
close_enough = 0,
cartesian_R = 5)]
houses
#> house_number lat_decimal lon_decimal station_number dist
#> 1: 1 1.1 1.4 1 67.62617
#> 2: 2 1.2 1.5 1 59.87076
#> 3: 3 1.3 1.6 1 55.59026
You may want to adjust close_enough
and cartesian_R
if your data are numerous (i.e. over a million points to match) for performance.
`cartesian_R`
The maximum radius of any address from the points to be geocoded. Used to accelerate the detection of minimum distances. Note, as the argument name suggests, the distance is in cartesian coordinates, so a small number is likely.
`close_enough`
The distance, in metres, below which a match will be considered to have occurred. (The distance that is considered "close enough" to be a match.)
For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later.
May be provided as a string to emphasize the units, e.g. close_enough = "0.25km". Only km and m are permitted.
Upvotes: 4
Reputation: 11957
Your question is a bit more complicated than a simple merge, and outer
is somewhat ill-suited for the purpose. To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. We'll need two packages:
library(tidyverse)
library(geosphere)
First, a bit of prep. distm
expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it:
houses <- data.frame(house_number = c(1:3),
lon_house = seq(1.4, 1.6, by = 0.1),
lat_house = seq(1.1, 1.3, by = 0.1)
)
stations <- data.frame(station_number = c(1:11),
lon_station = seq(2, 3, by = 0.1),
lat_station = seq(1, 2, by = 0.1)
)
We'll create "nested" data frames so that it's easier to keep coordinates together:
house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')
house_number house_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
station_number station_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
4 4 <data.frame [1 × 2]>
5 5 <data.frame [1 × 2]>
6 6 <data.frame [1 × 2]>
7 7 <data.frame [1 × 2]>
8 8 <data.frame [1 × 2]>
9 9 <data.frame [1 × 2]>
10 10 <data.frame [1 × 2]>
11 11 <data.frame [1 × 2]>
Use dplyr::crossing
to combine every row from both data frames:
data.master <- crossing(house_nest, station_nest)
house_number house_coords station_number station_coords
<int> <list> <int> <list>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]>
2 1 <data.frame [1 × 2]> 2 <data.frame [1 × 2]>
3 1 <data.frame [1 × 2]> 3 <data.frame [1 × 2]>
4 1 <data.frame [1 × 2]> 4 <data.frame [1 × 2]>
5 1 <data.frame [1 × 2]> 5 <data.frame [1 × 2]>
6 1 <data.frame [1 × 2]> 6 <data.frame [1 × 2]>
7 1 <data.frame [1 × 2]> 7 <data.frame [1 × 2]>
8 1 <data.frame [1 × 2]> 8 <data.frame [1 × 2]>
9 1 <data.frame [1 × 2]> 9 <data.frame [1 × 2]>
10 1 <data.frame [1 × 2]> 10 <data.frame [1 × 2]>
# ... with 23 more rows
With all this now in place, we can use distm
on each row to calculate a distance, and keep the shortest distance per house:
data.dist <- data.master %>%
mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>%
group_by(house_number) %>%
filter(dist == min(dist))
house_number house_coords station_number station_coords dist
<int> <list> <int> <list> <dbl>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 67690.
2 2 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 59883.
3 3 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 55519.
Upvotes: 2