Pandas apply formula to each row and find minimum

Question

I'm looking for an efficient way to apply a formula using variables from a single row of one dataframe (df1) against every row in another datframe (df2), then find the minimum value of this operation and store the row from df2 in which this minimum value occured as a new dataframe (df3). example input/output is given.

df1                
Index   X1  Y1  Z1      
1       3   6   4        
2       7   2   1       
3       4   7   3       

df2
Index   X2  Y2  Z2
1       2   4   1
2       5   3   2
3       7   1   5

Formuala to apply:

d = math.sqrt((X2-X1)**2 + (Y2-Y1)**2 + (Z2-Z1)**2)

If this formula was to be applied iteratively to df2, where (X1, Y1, Z1) is from df1 row1 and (X2, Y2, Z2) are taken from each row in df2 to give.

[out]
Index  d
1      3.741
2      4.123
3      6.481

Since (X2, Y2, Z2) in df2 row 1 provided the lowest d value, this row would be saved into df3 and the process then repeated for each row in df1.

df3
Index  X2  Y2  Z2  
1      2   4   1

*Note, df1 and df2 are of different length. Apologies if this question seems long-winded, I'm just trying to be as clear as possible.

Mustafa Aydın · Accepted Answer

scipy.spatial.distance.cdist can be used for this with its default Euclidean distance metric:

from scipy.spatial.distance import cdist

df3 = df2.iloc[cdist(df1, df2).argmin(axis=1)]

cdist gives back a (n1, n2) shaped array where n1 and n2 are number of rows in df1 and df2 respectively. Then we look at the index of the minimum distance for each row to see which row of df2 gave rise to it. iloc then selects these from df2,

to get

>>> df3

       X2  Y2  Z2
Index
1       2   4   1
2       5   3   2
1       2   4   1

the intermediate results:

>>> cdist(df1, df2)

# first row is your calculations in the question, for example
array([[3.74165739, 4.12310563, 6.4807407 ],
       [5.38516481, 2.44948974, 4.12310563],
       [4.12310563, 4.24264069, 7.        ]])

>>> cdist(df1, df2).argmin(axis=1)

array([0, 1, 0], dtype=int64)

i.e., for 0th and 2nd row of df1, 0th row of df2 is selected; for 1st row of df1, 1st row of df2 is selected (0 indexed).

As you remind the time-memory tradeoff, here is a for loop implementation:

# will keep the minimum distance rows' indices
min_inds = []

# foreach row of `df1`...
for row1 in df1.values:
    # these will keep track of min so-far
    min_dist = np.inf
    min_ind = None
    # foreach row of `df2`...
    for j, row2 in enumerate(df2.values):
        # squared distance
        dist = ((row1 - row2) ** 2).sum()
        # is less than minimum so far?
        if dist < min_dist:
            # then update min distance and index
            min_dist = dist
            min_ind = j
    # one row of `df1` finished; save its corresponding row's index
    min_inds.append(min_ind)

# Now we form `df3` with `iloc` as before
df3 = df2.iloc[min_inds]

which gives the same result but could be more memory-efficient.

Pandas apply formula to each row and find minimum

Answers (1)

Related Questions