Reputation: 15783
Suppose I have two DataFrame
s: XA
and XB
, for example each with 3 rows and 2 columns:
import pandas as pd
XA = pd.DataFrame({
'x1': [1, 2, 3],
'x2': [4, 5, 6]
})
XB = pd.DataFrame({
'x1': [8, 7, 6],
'x2': [5, 4, 3]
})
For each record in XA
, I want to find the nearest record (e.g. based on Euclidean distance) in XB
, and also the corresponding distance. For example this might return a DataFrame
indexed on id_A
, and with columns for id_B
and distance
.
How can I do this most efficiently?
Upvotes: 1
Views: 66
Reputation: 15783
Modifying this answer to avoid a full distance matrix, you can find the nearest record and distance for each row in XA
(nearest_record1()
), then call apply
to run through it on every row (nearest_record()
). This cuts run time by ~85% in a test.
from scipy.spatial.distance import cdist
def nearest_record1(XA1, XB):
"""Get the nearest record between XA1 and XB.
Args:
XA: Series.
XB: DataFrame.
Returns:
DataFrame with columns for id_B (from XB) and dist.
"""
dist = cdist(XA1.values.reshape(1, -1), XB)[0]
return pd.Series({'dist': np.amin(dist), 'id_B': np.argmin(dist)})
def nearest_record(XA, XB):
"""Get the nearest record in XA for each record in XB.
Args:
XA: DataFrame. Each record is matched against the nearest in XB.
XB: DataFrame.
Returns:
DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
Each id_A maps to a single id_B, which is the nearest record from XB.
"""
res = XA.apply(lambda x: nearest_record1(x, XB), axis=1)
res['id_A'] = XA.index
# id_B is sometimes returned as an object.
res['id_B'] = res.id_B.astype(int)
# Reorder columns.
return res[['id_A', 'id_B', 'dist']]
This also returns the correct result:
nearest_record(XA, XB)
id_A id_B dist
0 0 2 5.099020
1 1 2 4.472136
2 2 2 4.242641
Upvotes: 0
Reputation: 15783
One way is to compute the full distance matrix, then melt
it and aggregate using nsmallest
, which returns the index along with the value:
from scipy.spatial.distance import cdist
def nearest_record(XA, XB):
"""Get the nearest record in XA for each record in XB.
Args:
XA: DataFrame. Each record is matched against the nearest in XB.
XB: DataFrame.
Returns:
DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
Each id_A maps to a single id_B, which is the nearest record from XB.
"""
dist = pd.DataFrame(cdist(XA, XB)).reset_index().melt('index')
dist.columns = ['id_A', 'id_B', 'dist']
# id_B is sometimes returned as an object.
dist['id_B'] = dist.id_B.astype(int)
dist.reset_index(drop=True, inplace=True)
nearest = dist.groupby('id_A').dist.nsmallest(1).reset_index()
return nearest.set_index('level_1').join(dist.id_B).reset_index(drop=True)
This shows that id_B
2 is the nearest record to each of the three records in XA
:
nearest_record(XA, XB)
id_A dist id_B
0 0 5.099020 2
1 1 4.472136 2
2 2 4.242641 2
However, since this involves computing the full distance matrix, it will be slow or fail when XA
and XB
are large. An alternative that computes the nearest for each row would likely be faster.
Upvotes: 1