Reputation: 85
I have two data frame. One is user id with lat lon data and other is store code with store lat lon data. Around 89M rows are there. I want nearest (based on min.distance) store code corresponding user lat lon.
df1 -
id user_lat user_lon
1 13.031885 80.235574
2 19.099819 72.915288
3 22.226980 84.836070
df2 -
store_no s_lat s_lon
22 29.91 73.88
23 28.57 77.33
24 26.86 80.95
I have done so far -
from geopy.distance import vincenty
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
df1 = df1[['user_lat','user_lon']]
df2 = df2[['s_lat','s_lon']]
x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k',1)
x.head(20)
user_lat user_lon s_lat s_lon
0 13.031885 80.235574 29.91 73.88
1 13.031885 80.235574 28.57 77.33
2 13.031885 80.235574 26.86 80.95
3 19.099819 72.915288 29.91 73.88
4 19.099819 72.915288 28.57 77.33
5 19.099819 72.915288 26.86 80.95
6 22.226980 84.836070 29.91 73.88
7 22.226980 84.836070 28.57 77.33
8 22.226980 84.836070 26.86 80.95
x['dist'] = np.ravel(dist.pairwise(np.radians(store_lat_lon),np.radians(user_lat_lon)) * 6367)
user_lat user_lon s_lat s_lon dist
0 13.031885 80.235574 29.91 73.88 1986.237557
1 13.031885 80.235574 28.57 77.33 1205.217610
2 13.031885 80.235574 26.86 80.95 1386.069611
3 19.099819 72.915288 29.91 73.88 1752.628427
4 19.099819 72.915288 28.57 77.33 1143.731258
5 19.099819 72.915288 26.86 80.95 1031.246453
6 22.226980 84.836070 29.91 73.88 1538.449674
7 22.226980 84.836070 28.57 77.33 1190.620278
8 22.226980 84.836070 26.86 80.95 647.477461
But I want data frame looks like -
user_lat user_lon s_lat s_lon dist store_no
0 13.031885 80.235574 29.91 73.88 1986.237557 23
1 13.031885 80.235574 28.57 77.33 1205.217610 23
2 13.031885 80.235574 26.86 80.95 1386.069611 23
3 19.099819 72.915288 29.91 73.88 1752.628427 24
4 19.099819 72.915288 28.57 77.33 1143.731258 24
5 19.099819 72.915288 26.86 80.95 1031.246453 24
6 22.226980 84.836070 29.91 73.88 1538.449674 24
7 22.226980 84.836070 28.57 77.33 1190.620278 24
8 22.226980 84.836070 26.86 80.95 647.477461 24
Upvotes: 0
Views: 806
Reputation: 11105
Finding the nearest store of each user is a classic use case for either the k-d tree or ball tree data structures. Scikit-learn implements both, but only the BallTree
accepts the haversine distance metric, so we'll use that.
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric
# Set up example data
df1 = pd.DataFrame({'id': [1, 2, 3],
'user_lat': [13.031885, 19.099819, 22.22698],
'user_lon': [80.235574, 72.915288, 84.83607]})
df2 = pd.DataFrame({'store_no': [22, 23, 24],
's_lat': [29.91, 28.57, 26.86],
's_lon': [73.88, 77.33, 80.95]})
# Build k-d tree with haversine distance metric, which expects
# (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(df2[['s_lat', 's_lon']]), metric=dist)
coords = np.radians(df1[['user_lat', 'user_lon']])
dists, ilocs = tree.query(coords)
# dists is in rad; convert to km
df1['dist'] = dists.flatten() * 6367
df1['nearest_store'] = df2.iloc[ilocs.flatten()]['store_no'].values
# Result:
df1
id user_lat user_lon dist nearest_store
0 1 13.031885 80.235574 5061.416309 23
1 2 19.099819 72.915288 8248.857621 24
2 3 22.226980 84.836070 7483.628300 23
Upvotes: 2