Reputation: 47
I'm looking for an efficient way to apply a formula using variables from a single row of one dataframe (df1) against every row in another datframe (df2), then find the minimum value of this operation and store the row from df2 in which this minimum value occured as a new dataframe (df3). example input/output is given.
df1
Index X1 Y1 Z1
1 3 6 4
2 7 2 1
3 4 7 3
df2
Index X2 Y2 Z2
1 2 4 1
2 5 3 2
3 7 1 5
Formuala to apply:
d = math.sqrt((X2-X1)**2 + (Y2-Y1)**2 + (Z2-Z1)**2)
If this formula was to be applied iteratively to df2, where (X1, Y1, Z1) is from df1 row1 and (X2, Y2, Z2) are taken from each row in df2 to give.
[out]
Index d
1 3.741
2 4.123
3 6.481
Since (X2, Y2, Z2) in df2 row 1 provided the lowest d value, this row would be saved into df3 and the process then repeated for each row in df1.
df3
Index X2 Y2 Z2
1 2 4 1
*Note, df1 and df2 are of different length. Apologies if this question seems long-winded, I'm just trying to be as clear as possible.
Upvotes: 1
Views: 220
Reputation: 18306
scipy.spatial.distance.cdist
can be used for this with its default Euclidean distance metric:
from scipy.spatial.distance import cdist
df3 = df2.iloc[cdist(df1, df2).argmin(axis=1)]
cdist
gives back a (n1, n2)
shaped array where n1
and n2
are number of rows in df1
and df2
respectively. Then we look at the index of the minimum distance for each row to see which row of df2
gave rise to it. iloc
then selects these from df2
,
to get
>>> df3
X2 Y2 Z2
Index
1 2 4 1
2 5 3 2
1 2 4 1
the intermediate results:
>>> cdist(df1, df2)
# first row is your calculations in the question, for example
array([[3.74165739, 4.12310563, 6.4807407 ],
[5.38516481, 2.44948974, 4.12310563],
[4.12310563, 4.24264069, 7. ]])
>>> cdist(df1, df2).argmin(axis=1)
array([0, 1, 0], dtype=int64)
i.e., for 0th and 2nd row of df1
, 0th row of df2
is selected; for 1st row of df1
, 1st row of df2
is selected (0 indexed).
# will keep the minimum distance rows' indices
min_inds = []
# foreach row of `df1`...
for row1 in df1.values:
# these will keep track of min so-far
min_dist = np.inf
min_ind = None
# foreach row of `df2`...
for j, row2 in enumerate(df2.values):
# squared distance
dist = ((row1 - row2) ** 2).sum()
# is less than minimum so far?
if dist < min_dist:
# then update min distance and index
min_dist = dist
min_ind = j
# one row of `df1` finished; save its corresponding row's index
min_inds.append(min_ind)
# Now we form `df3` with `iloc` as before
df3 = df2.iloc[min_inds]
which gives the same result but could be more memory-efficient.
Upvotes: 1