Max Ghenis
Max Ghenis

Reputation: 15783

Nearest record and corresponding distance between each record in two DataFrames

Suppose I have two DataFrames: XA and XB, for example each with 3 rows and 2 columns:

import pandas as pd

XA = pd.DataFrame({
    'x1': [1, 2, 3],
    'x2': [4, 5, 6]
})

XB = pd.DataFrame({
    'x1': [8, 7, 6],
    'x2': [5, 4, 3]
})

For each record in XA, I want to find the nearest record (e.g. based on Euclidean distance) in XB, and also the corresponding distance. For example this might return a DataFrame indexed on id_A, and with columns for id_B and distance.

How can I do this most efficiently?

Upvotes: 1

Views: 66

Answers (2)

Max Ghenis
Max Ghenis

Reputation: 15783

Modifying this answer to avoid a full distance matrix, you can find the nearest record and distance for each row in XA (nearest_record1()), then call apply to run through it on every row (nearest_record()). This cuts run time by ~85% in a test.

from scipy.spatial.distance import cdist

def nearest_record1(XA1, XB):
    """Get the nearest record between XA1 and XB.

    Args:
        XA: Series.
        XB: DataFrame.

    Returns:
        DataFrame with columns for id_B (from XB) and dist.
    """
    dist = cdist(XA1.values.reshape(1, -1), XB)[0]
    return pd.Series({'dist': np.amin(dist), 'id_B': np.argmin(dist)})

def nearest_record(XA, XB):
    """Get the nearest record in XA for each record in XB.

    Args:
        XA: DataFrame. Each record is matched against the nearest in XB.
        XB: DataFrame.

    Returns:
        DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
        Each id_A maps to a single id_B, which is the nearest record from XB.
    """
    res = XA.apply(lambda x: nearest_record1(x, XB), axis=1)
    res['id_A'] = XA.index
    # id_B is sometimes returned as an object.
    res['id_B'] = res.id_B.astype(int)
    # Reorder columns.
    return res[['id_A', 'id_B', 'dist']]

This also returns the correct result:

nearest_record(XA, XB)
    id_A    id_B        dist
0      0       2    5.099020
1      1       2    4.472136
2      2       2    4.242641

Upvotes: 0

Max Ghenis
Max Ghenis

Reputation: 15783

One way is to compute the full distance matrix, then melt it and aggregate using nsmallest, which returns the index along with the value:

from scipy.spatial.distance import cdist

def nearest_record(XA, XB):
    """Get the nearest record in XA for each record in XB.

    Args:
        XA: DataFrame. Each record is matched against the nearest in XB.
        XB: DataFrame.

    Returns:
        DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
        Each id_A maps to a single id_B, which is the nearest record from XB.
    """
    dist = pd.DataFrame(cdist(XA, XB)).reset_index().melt('index')
    dist.columns = ['id_A', 'id_B', 'dist']
    # id_B is sometimes returned as an object.
    dist['id_B'] = dist.id_B.astype(int)
    dist.reset_index(drop=True, inplace=True)
    nearest = dist.groupby('id_A').dist.nsmallest(1).reset_index()
    return nearest.set_index('level_1').join(dist.id_B).reset_index(drop=True)

This shows that id_B 2 is the nearest record to each of the three records in XA:

nearest_record(XA, XB)

 id_A       dist id_B
0   0   5.099020    2
1   1   4.472136    2
2   2   4.242641    2

However, since this involves computing the full distance matrix, it will be slow or fail when XA and XB are large. An alternative that computes the nearest for each row would likely be faster.

Upvotes: 1

Related Questions