andy
andy

Reputation: 2181

how to quickly calculate distance with pandas longitudes and latitudes data?

I have a pandas table with different position data.For example, city of New York has a lot of schools ,hospitals, shops etc. Each has its own longitude and latitude.

I have 40000 rows data and two column data(longitude and latitude). I want to calculate distance between them( 40000*40000 total).

I used haversine formula (Haversine Formula in Python (Bearing and Distance between two GPS points)) to do that with pandas.

Simple code as:

results=df.apply(lambda x:haversine(x["lon"],x["lat"],test_lon,test_lat)

I use every row as test_lon,test_lat, and need 10 hours to calculate them. I can't believe why need so long to do that.

Any one has good idea to do it quickly?

Upvotes: 0

Views: 673

Answers (1)

I have a workaround that I have been using because I had the same issue for the Swedish transportation system in Stockholm. It is uggly but it works quite well. Might be useful. I make a copy of my original data:

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_A = pd.DataFrame({
    'Stopp_A' :     ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'latitude_A':  [ 56.75,56.19,56.08,51.28,52.36,51.29,51.87,52.61],
    'longitude_A': [18.39,18.82, 18.65,18.74,18.06,18.61,18.27,18.20]
})

locations_B = pd.DataFrame({
    'Stopp_B' :     ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
     'latitude_B':  [ 56.75,56.19,56.08,51.28,52.36,51.29,51.87,52.61],
    'longitude_B': [18.39,18.82,18.65,18.74,18.06,18.61,18.27,18.20]
})

As you see, I change the location name from Stopp_A to Stopp_B in the copy. After this, I compute radians and create a distance matrix:

locations_A[['lat_radians_A','long_radians_A']] = (
    np.radians(locations_A.loc[:,['latitude_A','longitude_A']])
)
locations_B[['lat_radians_B','long_radians_B']] = (
    np.radians(locations_B.loc[:,['latitude_B','longitude_B']])
)

dist = sklearn.neighbors.DistanceMetric.get_metric('haversine')
dist_matrix = (dist.pairwise
    (locations_A[['lat_radians_A','long_radians_A']],
     locations_B[['lat_radians_B','long_radians_B']])*6371 #Radius in kilometer
)

df_dist_matrix = (
    pd.DataFrame(dist_matrix,index=locations_A['Stopp_A'], 
                 columns=locations_B['Stopp_B'])
)

df_dist = (
    pd.melt(df_dist_matrix.reset_index(),id_vars='Stopp_A')
)
df_dist = df_dist_long.rename(columns={'value':'Kilometers'})

which returns:

   Stopp_A Stopp_B   Kilometers
0        A       A     0.000000
1        B       A  2088.626114
2        C       A  2043.060585
3        D       A   950.191543
4        E       A  1506.375876
..     ...     ...          ...
59       D       H  3051.681403
60       E       H  3990.191284
61       F       H  3737.181244
62       G       H  1083.053543
63       H       H     0.000000

This method reduced my computation time significantly.

Upvotes: 1

Related Questions