Reputation: 327
I currently have a dataframe (lang.py) in which I have pairs of latitude and longitude coordinates. I'm using the distHaversine() function from the geosphere package to do so.
This is a sample of my data (which has 25200 rows):
Originally I tried:
lang.py$distance = with(lang.py, distm(cbind(lon_x, lat_x), cbind(lon_y, lat_y), distHaversine))
But this was taking a long time to run so I attempted to look at the output of just the first 4 rows which was outputting a 4x4matrix of values rather than just a single column of the distance values so I assume for the entire dataset, my code was outputting a 25200x25200 matrix of distance values.
For example, this is what the first 4 rows were outputting:
with(lang.py[1:4,], distm(cbind(lon_x, lat_x), cbind(lon_y, lat_y), distHaversine))
To fix this, I attempted to take the diagonal of the matrix so get a single column of values:
lang.py$distance = diag(with(lang.py, distm(cbind(lon_x, lat_x), cbind(lon_y, lat_y), distHaversine)))
But this was also taking an extremely long time to run. Any ideas on how to make this more efficient? I am trying to find the distance between (lat_x, lon_x) and (lat_y, lon_y). Thanks
Upvotes: 0
Views: 278
Reputation: 160447
Running diag
on a long-running process still runs all of the long-running process and then filters out all except the diagonal elements. There's nothing to "inform" the inner code to only operate on specific elements.
Simpler than diag
would have been the first row, since all numbers in each column (in this example) are identical.
If you look at the source code for geosphere::distm
, you'll see that it is calculating the distance between the first row of the first argument (effectively cbind(lon_x, lat_x)[1,]
) with all rows of the second argument; then the second row of the first arg with all rows of the second arg. The reason you are seeing a matrix of identical values within a column is that in the sample above, your lon_x
/lat_x
are all the same. This is producing a distance matrix, not the distance between two points at a time.
It seems that you don't need a distance matrix, you just need distance.
with(lang.py, distHaversine(cbind(lon_x, lat_x), cbind(lon_y, lat_y)))
# [1] 4042785 5417756 13819986
This will calculate one distance for each row; there is no comparison of lat/lon on one row with lat/lon on another row ... and since you're looking at a data.frame
, that makes sense to me.
Data, so that anybody else can try to work your code (this should be you providing it, not me):
lang.py <- structure(list(lat_x = c(35, 35, 35), lon_x = c(66, 66, 66), lat_y = c(41, 36.5, -13.92), lon_y = c(20, 5, -171.83)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1