Angie
Angie

Reputation: 327

Finding distance between lat-long coordiantes taking a long time R

I currently have a dataframe (lang.py) in which I have pairs of latitude and longitude coordinates. I'm using the distHaversine() function from the geosphere package to do so.

This is a sample of my data (which has 25200 rows): enter image description here

Originally I tried:

lang.py$distance = with(lang.py, distm(cbind(lon_x, lat_x), cbind(lon_y, lat_y), distHaversine))

But this was taking a long time to run so I attempted to look at the output of just the first 4 rows which was outputting a 4x4matrix of values rather than just a single column of the distance values so I assume for the entire dataset, my code was outputting a 25200x25200 matrix of distance values.

For example, this is what the first 4 rows were outputting:

with(lang.py[1:4,], distm(cbind(lon_x, lat_x), cbind(lon_y, lat_y), distHaversine))

enter image description here

To fix this, I attempted to take the diagonal of the matrix so get a single column of values:

lang.py$distance = diag(with(lang.py, distm(cbind(lon_x, lat_x), cbind(lon_y, lat_y), distHaversine)))

But this was also taking an extremely long time to run. Any ideas on how to make this more efficient? I am trying to find the distance between (lat_x, lon_x) and (lat_y, lon_y). Thanks

Upvotes: 0

Views: 278

Answers (1)

r2evans
r2evans

Reputation: 160447

  1. Running diag on a long-running process still runs all of the long-running process and then filters out all except the diagonal elements. There's nothing to "inform" the inner code to only operate on specific elements.

  2. Simpler than diag would have been the first row, since all numbers in each column (in this example) are identical.

  3. If you look at the source code for geosphere::distm, you'll see that it is calculating the distance between the first row of the first argument (effectively cbind(lon_x, lat_x)[1,]) with all rows of the second argument; then the second row of the first arg with all rows of the second arg. The reason you are seeing a matrix of identical values within a column is that in the sample above, your lon_x/lat_x are all the same. This is producing a distance matrix, not the distance between two points at a time.

It seems that you don't need a distance matrix, you just need distance.

with(lang.py, distHaversine(cbind(lon_x, lat_x), cbind(lon_y, lat_y)))
# [1]  4042785  5417756 13819986

This will calculate one distance for each row; there is no comparison of lat/lon on one row with lat/lon on another row ... and since you're looking at a data.frame, that makes sense to me.


Data, so that anybody else can try to work your code (this should be you providing it, not me):

lang.py <- structure(list(lat_x = c(35, 35, 35), lon_x = c(66, 66, 66), lat_y = c(41, 36.5, -13.92), lon_y = c(20, 5, -171.83)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 1

Related Questions