Reputation: 11
I currently have a problem in which I would like to get the distance between the points (stored as lat and long) in 2 separate dataframes.
Specifically, one has 15,000 building locations as lat/long, and the other has 50,000 doctors' locations as lat/long.
The goal would be to get the doctors within a radius of X km of each building.
My first thought was to used vectorized functions. After some research, I selected GeoPandas, and by using:
...this works marvels and runs fast.
The only problem here is that I have a variable radius (X). The only solution I have would be to create multiple geodataframes with varying X, but this is inefficient. Therefore, I'd like to calculate the distance (meters) between each pair of points in (dataframe1, dataframe2), to then filter the resulting dataframe on the column distance.
I would suppose there is a vectorized function allowing fast distance calculation between 2 series, but the documentation on "distance()" or gpd functions did not reveal to me any useful function running in a vectorized manner such as the sjoin(). There are functions to find the nearest point, but this defeats the purpose of finding all doctors in the X km radius.
The only code I have on the distance calculation would be on a cartesian product, and is inefficient.
#test data with 600 pdv (buildings) and 200 pds (doctors)
data_pdv = {'pdv': range(1, 6001),
'latitude': [48.8566] * 3000 + [30.7128] * 3000,
'longitude': [2.3522] * 6000}
data_pds = {'pds': range(1, 201),
'latitude': [48.8588] * 200,
'longitude': [2.2944] * 200}
# Convert data to GeoDataFrames
gdf_pdv = gpd.GeoDataFrame(data_pdv, geometry=gpd.points_from_xy(data_pdv['longitude'], data_pdv['latitude']))
gdf_pds = gpd.GeoDataFrame(data_pds, geometry=gpd.points_from_xy(data_pds['longitude'], data_pds['latitude']))
# Create a Cartesian product of my pdv and pds (all combinations of rows)
cartesian_product = gdf_pdv.assign(key=1).merge(gdf_pds.assign(key=1), on='key').drop('key', 1)
# Calculate the geodesic distances
cartesian_product['distance'] = cartesian_product.apply(
lambda row: geodesic((row['latitude_x'], row['longitude_x']), (row['latitude_y'], row['longitude_y'])).meters,
axis=1
)
# Filter distances < 10km
output_df = cartesian_product[cartesian_product['distance'] < 10000]
Upvotes: 1
Views: 148