Get a coordinate distance matrix using Pandas without loops

Question

I am currently getting a distance matrix of coordinates from two data frames (ref_df and comp_df) using a nested for-loop over rows in both data frames, as shown below.

import geopy.distance
import pandas as pd

ref_df = pd.DataFrame({"grp_id":['M-00353','M-00353','M-00353','M-00538','M-00538','M-00160','M-00160','M-00160',
                  'M-00509','M-00509','M-00509','M-00509'],"name": ['B1','IIS','IISB I','BK',
                  'MM - BK','H(SL)','H(PKS SL)','PTH','ASSM 1','PKS SSM','SSM',
                  'Sukajadi Sawit Mekar 1'],"lat": [0.43462,0.43462,0.43462,1.74887222,1.74887222,-2.6081,
                   -2.6081,-2.6081, -2.378258,-2.378258,-2.378258,-2.378258],"long":[101.822603,101.822603,101.822603,101.3710944,101.3710944,
                   104.12525,104.12525,104.12525,112.542356,112.542356,112.542356,112.542356]})


comp_df = pd.DataFrame({"uml_id": ['PO1000000021','PO1000000054','PO1000000058','PO1000000106'],
                   "mill_name": ['PT IIS-BI','PT MM-BK','HL','PT SSM'],
                   "Latitude": [0.4344444,0.077043,-2.6081,-2.381111],"Longitude":[101.825,102.030838,104.12525,112.539722]})

matched_coords = []
for row in ref_df.index:
    mill_id = ref_df.get_value(row, "grp_id")
    mill_lat = ref_df.get_value(row, "lat")
    mill_long = ref_df.get_value(row, "long")

    for columns in comp_df.index:
        gm_id = comp_df.get_value(columns, "uml_id")
        gm_lat = comp_df.get_value(columns, "Latitude")
        gm_long = comp_df.get_value(columns, "Longitude")

        dist = geopy.distance.distance(
            (mill_lat, mill_long),
            (gm_lat, gm_long)).km 

        matched_coords.append([
            mill_id, mill_lat, mill_long,
            gm_id, gm_lat, gm_long, dist
        ])

# Convert to data frame        
mc_df = pd.DataFrame(matched_coords)
mc_df.columns = [
    'grp_id', 'grp_lat', 'grp_long',
    'match_id', 'match_lat', 'match_long', 'dist'
]

# Pivot to create wide data frame (matrix of distances)
mc_wide_df = mc_df.pivot_table(
    values="dist",
    index=["grp_id", "grp_lat","grp_long"],
    columns="match_id").reset_index()

However, I'd like to simplify the process and code by just creating a helper function using an apply on the data frames. My attempt below is not working. Is anybody able to help me figure out what's going wrong here.

# Test apply!
def get_coords_dist(x):
    dist = geopy.distance.distance((x['lat'],x['long']),(comp_df['Latitude'],comp_df['Longitude'])).km
    return pd.Series({comp_df.iloc[i[2]]['uml_id']: i for i in dist})

mc_df = ref_df.merge(ref_df.sort_values('grp_id').apply(get_coords_dist, axis=1), left_index=True, right_index=True)

Igor Raush · Accepted Answer

You're looking to perform a cross join between the two data frames ref_df and comp_df. One way to do this is to pd.merge on a dummy column.

def distance_km(x, y):
    return geopy.distance.distance(x, y).km

# it looks like your coordinates depend only on grp_id
ref_df_dd = ref_df.drop_duplicates(['grp_id', 'lat', 'long'])

# assign a dummy "_" column in both data frames, merge, and drop the dummy
# column afterwards
merged_df = pd.merge(
    ref_df_dd.assign(_=1),
    comp_df.assign(_=1),
).drop('_', axis=1)

# apply your distance function on (lat, long) tuples in the Cartesian product
merged_df['distance'] = list(
    map(distance_km,
        merged_df[['lat', 'long']].apply(tuple, 1),
        merged_df[['Latitude', 'Longitude']].apply(tuple, 1)))

# pivot table
merged_df.set_index(['grp_id', 'uml_id']).distance.unstack()

At this point merged_df looks like

uml_id   PO1000000021  PO1000000054  PO1000000058  PO1000000106
grp_id                                                         
M-00160    422.745678    377.461999      0.000000    936.147322
M-00353      0.267531     45.832819    422.922708   1232.700696
M-00509   1232.642382   1200.904305    936.449658      0.430525
M-00538    153.871840    198.911938    571.009484   1324.234511

which is pretty close to what you want.

Another solution (which is more transparent and 2x faster than the approach above) makes use of itertools.product.

from itertools import product

# create a data frame by iterating over row pairs in the Cartesian product
merged_df = pd.DataFrame([{
    'grp_id': r.grp_id,
    'uml_id': c.uml_id,
    'distance': distance_km((r.lat, r.long), (c.Latitude, c.Longitude))
} for r, c in product(ref_df_dd.itertuples(), comp_df.itertuples())])

# pivot table
merged_df.set_index(['grp_id', 'uml_id']).distance.unstack()

This gives the same merged_df as above.

Get a coordinate distance matrix using Pandas without loops

Answers (1)

Related Questions