Reputation: 28169
I have a data.frame (see below) of airport codes. I'm trying to run a (airportr::airport_distance) to get the distance between each pair. I'm getting an error when I try to run it on the full data frame (see code below). Any ideas why this won't work?
df1 <- structure(list(orig_station = c("LAX", "BUF", "ATL", "DEN", "ORD",
"DEN", "MEM", "TYS", "IAH", "CID"), dest_station = c("SFO", "MIA",
"CAE", "DEN", "IND", "DEN", "MEM", "TPA", "IAH", "PDX")), row.names = c(NA,
10L), class = "data.frame")
df1$dist <- airport_distance(df1$orig_station, df1$dest_station)
Upvotes: 0
Views: 74
Reputation: 79318
Looked into the airport_distance
function and noticed that it is not vectorized. This is not good since with large dataset you will not be able to compute the distances. You probably should consider writing a vectorized function. A simple case would be:
vec_dist <- function(df){
air <- unlist(df)
match1 <- dplyr::filter(airports, IATA%in%unique(air))
point <- match(air, match1$IATA)
lon <- matrix((match1$Longitude * pi/180)[point], ncol = 2)
lat <- matrix((match1$Latitude * pi/180)[point], ncol = 2)
radius <- 6373
dlon = lon[,2] - lon[,1]
dlat = lat[,2] - lat[,1]
a = (sin(dlat/2))^2 + cos(lat[,1]) * cos(lat[,2]) * (sin(dlon/2))^2
b = 2 * atan2(sqrt(a), sqrt(1 - a))
cbind(df, dist= radius * b)
}
vec_dist(df1)
orig_station dest_station dist
1 LAX SFO 543.3598
2 BUF MIA 1912.5540
3 ATL CAE 307.6851
4 DEN DEN 0.0000
5 ORD IND 285.6848
6 DEN DEN 0.0000
7 MEM MEM 0.0000
8 TYS TPA 882.3557
9 IAH IAH 0.0000
10 CID PDX 2500.2793
Why would I consider writing your own function? A quick benchmark gives you the idea:
microbenchmark::microbenchmark(vec_dist(df1),
unlist_Map=unlist(Map(airport_distance, df1$orig_station, df1$dest_station)),
apply_=apply(df1[c('orig_station', 'dest_station')], 1, function(x) airport_distance(x[1], x[2])),
vectorize=Vectorize(airport_distance)(df1$orig_station, df1$dest_station), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
vec_dist(df1) 3.176101 3.176101 3.536051 3.536051 3.896001 3.896001 2
unlist_Map 431.611700 431.611700 498.710251 498.710251 565.808801 565.808801 2
apply_ 572.807201 572.807201 577.864401 577.864401 582.921601 582.921601 2
vectorize 483.825801 483.825801 528.993851 528.993851 574.161900 574.161900 2
Yet this is running it on a data with 10 rows. What would happen if the data was to increase with almost similar points??
df1 <- df1[rep(1:10, each=100), ]
Unit: milliseconds
expr min lq mean median uq max neval
vec_dist(df1) 7.084901 7.084901 8.564601 8.564601 10.0443 10.0443 2
unlist_Map 45161.593601 45161.593601 45229.421051 45229.421051 45297.2485 45297.2485 2
apply_ 45536.644800 45536.644800 53869.454001 53869.454001 62202.2632 62202.2632 2
vectorize 45286.505601 45286.505601 51775.855502 51775.855502 58265.2054 58265.2054 2
Upvotes: 3
Reputation: 887651
We can use Map
or mapply
as the function is not Vectorize
d.
library(airportr)
df1$dist <- unlist(Map(airport_distance, df1$orig_station, df1$dest_station))
Or with apply
df1$dist <- apply(df1[c('orig_station', 'dest_station')], 1,
function(x) airport_distance(x[1], x[2]))
Or another option is to Vectorize
Vectorize(airport_distance)(df1$orig_station, df1$dest_station)
# LAX BUF ATL DEN ORD DEN MEM TYS IAH CID
# 543.3598 1912.5540 307.6851 0.0000 285.6848 0.0000 0.0000 882.3557 0.0000 2500.2793
Or using tidyverse
library(dplyr)
library(purrr)
df1 %>%
mutate(dist = map2_dbl(orig_station, dest_station, airport_distance))
-output
# orig_station dest_station dist
#1 LAX SFO 543.3598
#2 BUF MIA 1912.5540
#3 ATL CAE 307.6851
#4 DEN DEN 0.0000
#5 ORD IND 285.6848
#6 DEN DEN 0.0000
#7 MEM MEM 0.0000
#8 TYS TPA 882.3557
#9 IAH IAH 0.0000
#10 CID PDX 2500.2793
Or using rowwise
df1 %>%
rowwise %>%
mutate(dist = airport_distance(orig_station, dest_station)) %>%
ungroup
Upvotes: 2