Dennis Golomazov
Dennis Golomazov

Reputation: 17359

Pandas apply function to groups, and filter the original dataframe

I have a DataFrame containing objects and their coordinates:

      id        lat         lng
0   3816  18.384001  -66.114799
1   5922  20.766100 -156.434998
2   1527  21.291394 -157.843085
3   1419  21.291394 -157.843085
4   1651  21.291394 -157.843085

Multiple objects can have same coordinates. The dataframe is big (millions of records). I have a target point with coordinates (target_lat, target_lng). My goal is to find objects in the dataframe which are within X miles of the target point, as efficiently as possible.

I'm using haversine_np function adapted from this question. It takes parameters (lat_series, lng_series, lat, lng) and efficiently computes all distances between lat_series, lng_series (two Series) and (lat, lng) (two numbers).

Now my question is how to use it to filter the distances and select objects in the original dataframe.

This is my current solution:

grouper = df.groupby(['lat', 'lng'], sort=False).grouper
lat_series = grouper.result_index.get_level_values(0)  # lats of unique (lat, lng) pairs
lng_series = grouper.result_index.get_level_values(1)  # lngs of unique (lat, lng) pairs
df['location_index'] = grouper.group_info[0]  # assign index of group back to df
distances = haversine_np(lat_series, lng_series, target_lat, target_lng)
mask = distances <= 50  # let's say 50 miles; boolean mask of size = ngroups
loc_indexes = pd.Series(range(grouper.ngroups))[mask]  # select group indexes by mask
df[df.location_index.isin(loc_indexes)]  # select original records by group indexes

It seems to work, although doesn't look reliable, because when I select the relevant group indexes by using pd.Series(range(grouper.ngroups))[mask], I assume that the level values of the grouped are naturally indexed (from 0 to ngroups-1). In other words, I'm relying on the fact that the i-th element in grouper.result_index.get_level_values() corresponds to the group with label i in grouper.group_info[0]. I couldn't find a more explicit way to get that mapping.

Questions:

  1. Is the method I'm using reliable?
  2. Is there a better (safer / more concise / more efficient) method?

Upvotes: 1

Views: 814

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210972

UPDATE: @DennisGolomazov has found out that this "prefiltering" is not going to work properly for longitudes and make a very good example - here is a small demo:

In [115]: df
Out[115]:
     id   lat    lng
5  4444  40.0 -121.0
0  1111  40.0 -120.0

In [116]: %paste
threshold = 60
max_lng_factor = 69.17
max_lat_factor = 69.41
target_lat, target_lng = 40, -120
mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
       & \
       df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)
x = df.loc[mask, ['lat','lng']].drop_duplicates()
## -- End pasted text --

In [117]: x
Out[117]:
    lat    lng
0  40.0 -120.0

where the distance between these two coordinates is less than our threshold (60 miles):

In [119]: haversine_np(-120, 40, -121, 40)
Out[119]: 52.895043596886239

Conclusion: we can prefilter latitudes, but not the longitudes:

In [131]: df
Out[131]:
     id   lat    lng
5  4444  40.0 -121.0
0  1111  40.0 -120.0
1  2222  42.0 -121.0

Correct prefiltering:

In [132]: mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor)
     ...: x = df.loc[mask, ['lat','lng']].drop_duplicates()
     ...:

In [133]: x
Out[133]:
    lat    lng
5  40.0 -121.0
0  40.0 -120.0

Check:

In [135]: df.reset_index() \
     ...:   .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
     ...:           .query("distance <= @threshold"),
     ...:          on=['lat','lng'])
     ...:
Out[135]:
   index    id   lat    lng   distance
0      5  4444  40.0 -121.0  52.895044
1      0  1111  40.0 -120.0   0.000000

Old, partially incorrect answer:

I would try to do prefiltering in order to optimize the calculations. For example you can easily filter out the points that are definitely outside of your "rectangle of interest".

Demo:

threshold = 100

# http://gis.stackexchange.com/questions/142326/calculating-longitude-length-in-miles/142327#142327
max_lng_factor = 69.17
max_lat_factor = 69.41

target_lat, target_lng = 21.29, -157.84

mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
       & \
       df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)

x = df.loc[mask, ['lat','lng']].drop_duplicates()

df.reset_index() \
  .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
          .query("distance <= @threshold"),
         on=['lat','lng']) \
  .drop('distance',1) \
  .set_index('index')

Result:

In [142]: df.reset_index() \
     ...:   .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
     ...:           .query("distance <= @threshold"),
     ...:          on=['lat','lng']) \
     ...:   .drop('distance',1) \
     ...:   .set_index('index')
     ...:
Out[142]:
         id        lat         lng
index
1      5922  20.766100 -156.434998
2      1527  21.291394 -157.843085
3      1419  21.291394 -157.843085
4      1651  21.291394 -157.843085

Upvotes: 1

Jim Parker
Jim Parker

Reputation: 1145

Maybe I'm missing something on the efficiency, but I don't understand why you are using the .grouper method. To get the Lat and Long series just reference them, i.e. df['lat'] or df.lat, then you can directly compute the distances with

distances = haversine_np(df.lat, df.lng, target_lat, target_lng)

and create a mask with

mask = distances <= 50

The mask is now indexed to the dataframe.

df[mask]

will provide only the True elements.

Upvotes: 0

Related Questions