Efficiently Finding Nearest Locations from Large Datasets in Python

Question

I am working on a project where I need to find the three nearest geographical locations (from a dataset of 1,000 entries) for each entry in another large dataset consisting of 500,000 addresses. Each dataset includes latitude and longitude coordinates. I've been using Python with geopy and pandas, but due to the size of the datasets I am not sure what could be the best approach.

Here is a simplified version of my code (reproducible):

import pandas as pd
from geopy.distance import geodesic

# Sample data setup
data = {
    "Zip Code": ["10115", "20095", "50667", "80331", "70173", "40210", "41460", "45127", "47051", "40474"],
    "City": ["Berlin", "Hamburg", "Cologne", "Munich", "Stuttgart", "Düsseldorf", "Neuss", "Essen", "Duisburg", "Düsseldorf Airport"],
    "Latitude": [52.5323, 53.5503, 50.9367, 48.1372, 48.7833, 51.2217, 51.1981, 51.4556, 51.4344, 51.2895],
    "Longitude": [13.3846, 9.9930, 6.9540, 11.5755, 9.1815, 6.7763, 6.6913, 7.0116, 6.7623, 6.7668]
}
df = pd.DataFrame(data)

data_df2 = {
    "Address": ["my_location"],
    "ZipCode": ["40468"],
    "Latitude": [51.28472232436951],
    "Longitude": [6.7865234073914005]
}
df2 = pd.DataFrame(data_df2)

# Distance calculation
def calculate_distance(lat1, lon1, lat2, lon2):
    return geodesic((lat1, lon1), (lat2, lon2)).kilometers

def find_nearest_spots(address_lat, address_lon):
    distances = df.apply(lambda row: calculate_distance(address_lat, address_lon, row['Latitude'], row['Longitude']), axis=1)
    nearest_indices = distances.nsmallest(3).index
    nearest_spots = df.loc[nearest_indices]
    return pd.Series({
        'Nearest_1_Spot': nearest_spots.iloc[0]['City'],
        'Nearest_1_Dist': distances.iloc[nearest_indices[0]],
        'Nearest_2_Spot': nearest_spots.iloc[1]['City'],
        'Nearest_2_Dist': distances.iloc[nearest_indices[1]],
        'Nearest_3_Spot': nearest_spots.iloc[2]['City'],
        'Nearest_3_Dist': distances.iloc[nearest_indices[2]]
    })

df2[['Nearest_1_Spot', 'Nearest_1_Dist', 'Nearest_2_Spot', 'Nearest_2_Dist', 'Nearest_3_Spot', 'Nearest_3_Dist']] = df2.apply(
    lambda row: find_nearest_spots(row['Latitude'], row['Longitude']), axis=1)

if needed I can also leverage free GPU libraries (if any)

Thanks for guidance

Efficiently Finding Nearest Locations from Large Datasets in Python

Answers (1)

Related Questions