R Apply distance function on all rows of data.frame

Question

I have a data.frame (see below) of airport codes. I'm trying to run a (airportr::airport_distance) to get the distance between each pair. I'm getting an error when I try to run it on the full data frame (see code below). Any ideas why this won't work?

df1 <- structure(list(orig_station = c("LAX", "BUF", "ATL", "DEN", "ORD", 
"DEN", "MEM", "TYS", "IAH", "CID"), dest_station = c("SFO", "MIA", 
"CAE", "DEN", "IND", "DEN", "MEM", "TPA", "IAH", "PDX")), row.names = c(NA, 
10L), class = "data.frame")

df1$dist <- airport_distance(df1$orig_station, df1$dest_station)

Onyambu · Accepted Answer

Looked into the airport_distance function and noticed that it is not vectorized. This is not good since with large dataset you will not be able to compute the distances. You probably should consider writing a vectorized function. A simple case would be:

vec_dist <- function(df){
  air <- unlist(df)
  match1 <- dplyr::filter(airports, IATA%in%unique(air))
  point <- match(air, match1$IATA)
  lon <- matrix((match1$Longitude * pi/180)[point], ncol = 2)
  lat <- matrix((match1$Latitude * pi/180)[point], ncol = 2)
  radius <- 6373
  dlon = lon[,2] - lon[,1] 
  dlat = lat[,2] - lat[,1]
  a = (sin(dlat/2))^2 + cos(lat[,1]) * cos(lat[,2]) * (sin(dlon/2))^2
  b = 2 * atan2(sqrt(a), sqrt(1 - a))
  cbind(df, dist= radius * b)
}

vec_dist(df1)
   orig_station dest_station      dist
1           LAX          SFO  543.3598
2           BUF          MIA 1912.5540
3           ATL          CAE  307.6851
4           DEN          DEN    0.0000
5           ORD          IND  285.6848
6           DEN          DEN    0.0000
7           MEM          MEM    0.0000
8           TYS          TPA  882.3557
9           IAH          IAH    0.0000
10          CID          PDX 2500.2793

Why would I consider writing your own function? A quick benchmark gives you the idea:

microbenchmark::microbenchmark(vec_dist(df1),
   unlist_Map=unlist(Map(airport_distance, df1$orig_station, df1$dest_station)),
   apply_=apply(df1[c('orig_station', 'dest_station')], 1, function(x) airport_distance(x[1], x[2])),
   vectorize=Vectorize(airport_distance)(df1$orig_station, df1$dest_station), times=2)
Unit: milliseconds
          expr        min         lq       mean     median         uq        max neval
 vec_dist(df1)   3.176101   3.176101   3.536051   3.536051   3.896001   3.896001     2
    unlist_Map 431.611700 431.611700 498.710251 498.710251 565.808801 565.808801     2
        apply_ 572.807201 572.807201 577.864401 577.864401 582.921601 582.921601     2
     vectorize 483.825801 483.825801 528.993851 528.993851 574.161900 574.161900     2

Yet this is running it on a data with 10 rows. What would happen if the data was to increase with almost similar points??

df1 <- df1[rep(1:10, each=100), ]

Unit: milliseconds
          expr          min           lq         mean       median         uq        max neval
 vec_dist(df1)     7.084901     7.084901     8.564601     8.564601    10.0443    10.0443     2
    unlist_Map 45161.593601 45161.593601 45229.421051 45229.421051 45297.2485 45297.2485     2
        apply_ 45536.644800 45536.644800 53869.454001 53869.454001 62202.2632 62202.2632     2
     vectorize 45286.505601 45286.505601 51775.855502 51775.855502 58265.2054 58265.2054     2

R Apply distance function on all rows of data.frame

Answers (2)

Related Questions