Reputation: 17359
I have a DataFrame containing objects and their coordinates:
id lat lng
0 3816 18.384001 -66.114799
1 5922 20.766100 -156.434998
2 1527 21.291394 -157.843085
3 1419 21.291394 -157.843085
4 1651 21.291394 -157.843085
Multiple objects can have same coordinates. The dataframe is big (millions of records). I have a target point with coordinates (target_lat, target_lng)
. My goal is to find objects in the dataframe which are within X miles of the target point, as efficiently as possible.
I'm using haversine_np
function adapted from this question. It takes parameters (lat_series, lng_series, lat, lng)
and efficiently computes all distances between lat_series, lng_series
(two Series) and (lat, lng)
(two numbers).
Now my question is how to use it to filter the distances and select objects in the original dataframe.
This is my current solution:
grouper = df.groupby(['lat', 'lng'], sort=False).grouper
lat_series = grouper.result_index.get_level_values(0) # lats of unique (lat, lng) pairs
lng_series = grouper.result_index.get_level_values(1) # lngs of unique (lat, lng) pairs
df['location_index'] = grouper.group_info[0] # assign index of group back to df
distances = haversine_np(lat_series, lng_series, target_lat, target_lng)
mask = distances <= 50 # let's say 50 miles; boolean mask of size = ngroups
loc_indexes = pd.Series(range(grouper.ngroups))[mask] # select group indexes by mask
df[df.location_index.isin(loc_indexes)] # select original records by group indexes
It seems to work, although doesn't look reliable, because when I select the relevant group indexes by using pd.Series(range(grouper.ngroups))[mask]
, I assume that the level values of the grouped are naturally indexed (from 0 to ngroups-1). In other words, I'm relying on the fact that the i-th
element in grouper.result_index.get_level_values()
corresponds to the group with label i
in grouper.group_info[0]
. I couldn't find a more explicit way to get that mapping.
Questions:
Upvotes: 1
Views: 814
Reputation: 210972
UPDATE: @DennisGolomazov has found out that this "prefiltering" is not going to work properly for longitudes and make a very good example - here is a small demo:
In [115]: df
Out[115]:
id lat lng
5 4444 40.0 -121.0
0 1111 40.0 -120.0
In [116]: %paste
threshold = 60
max_lng_factor = 69.17
max_lat_factor = 69.41
target_lat, target_lng = 40, -120
mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
& \
df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)
x = df.loc[mask, ['lat','lng']].drop_duplicates()
## -- End pasted text --
In [117]: x
Out[117]:
lat lng
0 40.0 -120.0
where the distance between these two coordinates is less than our threshold (60 miles):
In [119]: haversine_np(-120, 40, -121, 40)
Out[119]: 52.895043596886239
Conclusion: we can prefilter latitudes, but not the longitudes:
In [131]: df
Out[131]:
id lat lng
5 4444 40.0 -121.0
0 1111 40.0 -120.0
1 2222 42.0 -121.0
Correct prefiltering:
In [132]: mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor)
...: x = df.loc[mask, ['lat','lng']].drop_duplicates()
...:
In [133]: x
Out[133]:
lat lng
5 40.0 -121.0
0 40.0 -120.0
Check:
In [135]: df.reset_index() \
...: .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
...: .query("distance <= @threshold"),
...: on=['lat','lng'])
...:
Out[135]:
index id lat lng distance
0 5 4444 40.0 -121.0 52.895044
1 0 1111 40.0 -120.0 0.000000
Old, partially incorrect answer:
I would try to do prefiltering in order to optimize the calculations. For example you can easily filter out the points that are definitely outside of your "rectangle of interest".
Demo:
threshold = 100
# http://gis.stackexchange.com/questions/142326/calculating-longitude-length-in-miles/142327#142327
max_lng_factor = 69.17
max_lat_factor = 69.41
target_lat, target_lng = 21.29, -157.84
mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
& \
df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)
x = df.loc[mask, ['lat','lng']].drop_duplicates()
df.reset_index() \
.merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
.query("distance <= @threshold"),
on=['lat','lng']) \
.drop('distance',1) \
.set_index('index')
Result:
In [142]: df.reset_index() \
...: .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
...: .query("distance <= @threshold"),
...: on=['lat','lng']) \
...: .drop('distance',1) \
...: .set_index('index')
...:
Out[142]:
id lat lng
index
1 5922 20.766100 -156.434998
2 1527 21.291394 -157.843085
3 1419 21.291394 -157.843085
4 1651 21.291394 -157.843085
Upvotes: 1
Reputation: 1145
Maybe I'm missing something on the efficiency, but I don't understand why you are using the .grouper method. To get the Lat and Long series just reference them, i.e. df['lat'] or df.lat, then you can directly compute the distances with
distances = haversine_np(df.lat, df.lng, target_lat, target_lng)
and create a mask with
mask = distances <= 50
The mask is now indexed to the dataframe.
df[mask]
will provide only the True elements.
Upvotes: 0