Wboy
Wboy

Reputation: 2542

Efficient way to find which row a pair of GPS coordinates belong to

I have a dataframe (Call it A) with a set of GPS lat/long coordinates

Lat | Long
28.6752213, 77.09311140000001

I have another CSV (with many rows, over a million - call it B) of the form enter image description here

Which basically is a grid, with the lat/long coordinates of the 4 corners.

The Problem

I need to find for every row in A, which (non-unique) row it is bounded by in B. As in, the gps coordinates are inside the box as described by the row in B. I have a function that returns True/False when given the coords from A and the row in B.

Right now I'm doing a bruteforce approach, iterating through the whole B dataframe and checking every row if it belongs to that box or not. However, this is incredibly inefficient and very slow.

I'm sure there must be a better way for this, as it's a common problem. Can anyone point me to them?

Thank you! :)

Edit:

Code for the function im using to find if a particular gps_coord belongs in the box defined by a row

import matplotlib.path as path
def find_if_point_in_bounding_box(row,gps_coords):
    top_left_lat = row['top_left_lat']
    top_left_long = row['top_left_long']
    top_right_lat = row['top_right_lat']
    top_right_long = row['top_right_long']
    bottom_left_lat = row['bottom_left_lat']
    bottom_left_long = row['bottom_left_long']
    bottom_right_lat = row['bottom_right_lat']
    bottom_right_long = row['bottom_right_long']

    lat,long = gps_coords
     # create box
    p = path.Path([(top_left_lat, top_left_long),(top_right_lat,top_right_long),(bottom_left_lat,bottom_left_long),(bottom_right_lat,bottom_right_long)])
    res = p.contains_points([(lat,long)])[0]
    return res

Upvotes: 0

Views: 827

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122022

Your 8 coordinates contain only 4 unique values each: 2 latitudes (forming the top and bottom boundaries of each box, or the northerly and southerly bounds) and two longitudes (the left and right boundaries, westerly and easterly). Between the 4 (lat, lon) combinations for the 4 corners you’ll see that there are repeated values. You only need to compare your positions with the 4 boundaries, the latitude should fall between (or on) the two latitude bounds, and the longitude should fall between the two longitude bounds.

So you can simply ask for rows that have matching bounding boxes based on one each of top_*_lat and bottom_*_lat for the latitude, and one each of *_left_long and *_right_long for the longitudes:

lat, long = <latitude>, <longitude>
matching_rows = df.query(
    # top and bottom latitudes, top lat > bottom lat, north to south
    "top_left_lat >= @lat >= bottom_right_lat and "
    # left and right longitudes, left long < right long, west to east
    "top_left_long <= @long <= bottom_right_long"
)

The above pandas.DataFrame.query() expression just does a simple geometric point containment test and assumes that your bounding boxes do not cross the anti-meridian (international dateline) nor overlap with either pole.

You'll have to do this for each position in your input dataframe; Pandas can't merge dataframes based on arbitrary expressions (yet). You could group your inputs by one of the two coordinates to produce a subset of rows that match that one coordinate, then further filter them on the second coordinate for each group.

If your input dataframe is also very large, then it may perhaps be better to use a database for such a join.

Upvotes: 2

Related Questions