Euclidean Distance over 2 dataframes

Question

I have 2 Dataframes

DF1-

             Name        X        Y
0      Astonished    0.430    0.890
1         Excited    0.700    0.720
2       Expectant    0.320    0.067
3      Passionate    0.333    0.127

[47 rows * 3 columns]

DF2-

           Id        X         Y
0          1 -0.288453  0.076105
1          4 -0.563453 -0.498895
2          5 -0.788453 -0.673895
3          6 -0.063453 -0.373895
4          7  0.311547  0.376105

[767 rows * 3 columns]

Now what I want to achieve is - Take the X,Y from first entry from DF2, iterate it over DF1, calculate Euclidean Distance between each value of X,Y in DF2.

Find the minimum of all the Euclidean Distance obtained between the two points, save the minimum result somewhere along with the corresponding entry under the name column.

Example- Say for any tuple of X,Y in DF2, the minimum Euclidean distance is corresponding to the X,Y value in the row 0 of DF1, then the result should be, the distance and name Astonished.

My Attempt-

import pandas as pd
import numpy as np
import csv

mood = pd.read_csv("C:/Users/Desktop/DF1.csv")
song_value = pd.read_csv("C:/Users/Desktop/DF2.csv")

df_temp = mood.loc[:, ['Arousal','Valence']]

df_temp1 = song_value.loc[:, ['Arousal','Valence']]


import scipy
from scipy import spatial
ary = scipy.spatial.distance.cdist(mood.loc[:, ['Arousal','Valence']], song_value.loc[:, ['Arousal','Valence']], metric='euclidean')
print (ary)

Result Obtained -

[[1.08563344 1.70762362 1.98252253 ... 0.64569366 0.47426051 0.83656989]
 [1.17967807 1.75556794 2.03922435 ... 0.59326275 0.2469077  0.79334076]
 [0.60852124 1.04915517 1.33326431 ... 0.1848471  0.53293637 0.08394834]
 ...
 [1.26151359 1.5500629  1.81168766 ... 0.74070027 0.70209658 0.75277205]
 [0.69085994 1.03764923 1.31608627 ... 0.33265268 0.61928227 0.21397822]
 [0.84484398 1.11428893 1.38222899 ... 0.48330291 0.69288125 0.3886008 ]]

I have no clue how I should proceed now. Please suggest something.

EDIT - 1

I converted the array in another data frame using

new_series = pd.DataFrame(ary)
print (new_series)

Result -

         0         1         2    ...       764       765       766
0   1.085633  1.707624  1.982523  ...  0.645694  0.474261  0.836570
1   1.179678  1.755568  2.039224  ...  0.593263  0.246908  0.793341
2   0.608521  1.049155  1.333264  ...  0.184847  0.532936  0.083948
3   0.623534  1.093331  1.378075  ...  0.124156  0.479393  0.109057
4   0.791926  1.352785  1.636748  ...  0.197403  0.245908  0.398619
5   0.740038  1.260768  1.545785  ...  0.092072  0.304926  0.281791
6   0.923284  1.523395  1.803676  ...  0.415540  0.293217  0.611312
7   1.202447  1.679660  1.962823  ...  0.554256  0.247391  0.703298
8   0.824898  1.343684  1.628727  ...  0.177560  0.222666  0.360980
9   1.191411  1.604942  1.883150  ...  0.570771  0.395957  0.668736
10  0.822236  1.456863  1.708469  ...  0.706252  0.787271  0.823542
11  0.741683  1.371996  1.618916  ...  0.704496  0.835235  0.798964
12  0.346244  0.967891  1.240839  ...  0.376504  0.715617  0.359700
13  0.526096  1.163209  1.421820  ...  0.520190  0.748265  0.579333
14  0.435992  0.890291  1.083229  ...  0.937048  1.254437  0.884499
15  0.600338  1.162469  1.375755  ...  0.876228  1.116301  0.891714
16  0.634254  1.059083  1.226407  ...  1.088393  1.373536  1.058550
17  0.712227  1.284502  1.498187  ...  0.917272  1.117806  0.956957
18  0.194387  0.799728  1.045745  ...  0.666713  1.013563  0.597524
19  0.456000  0.708741  0.865870  ...  1.068296  1.420654  0.973234
20  0.633776  0.632060  0.709202  ...  1.277083  1.645173  1.157765
21  0.192291  0.597749  0.826602  ...  0.831713  1.204117  0.716746
22  0.522033  0.526969  0.645998  ...  1.170316  1.546040  1.041762
23  0.668148  0.504480  0.547920  ...  1.316602  1.698041  1.176933
24  0.718440  0.285718  0.280984  ...  1.334008  1.727796  1.166364
25  0.759187  0.265412  0.217165  ...  1.362786  1.757580  1.190132
26  0.598326  0.113459  0.380513  ...  1.087573  1.479296  0.896239
27  0.676841  0.263613  0.474246  ...  1.074911  1.456515  0.875707
28  0.865641  0.365394  0.462742  ...  1.239941  1.612779  1.038790
29  0.463623  0.511737  0.786284  ...  0.719525  1.099122  0.519226
30  0.780386  0.550483  0.750532  ...  0.987863  1.336760  0.788449
31  1.077559  0.711697  0.814205  ...  1.274933  1.602953  1.079529
32  1.020408  0.497152  0.522999  ...  1.372444  1.736938  1.170889
33  0.963911  0.367018  0.336035  ...  1.398444  1.778496  1.198905
34  1.092763  0.759612  0.873457  ...  1.256086  1.574565  1.063570
35  0.903631  0.810449  1.018501  ...  0.921287  1.219046  0.740134
36  0.728728  0.795942  1.045868  ...  0.695317  1.009043  0.512147
37  0.738314  0.600405  0.822742  ...  0.895225  1.239125  0.697393
38  1.206901  1.151385  1.343654  ...  1.064721  1.273002  0.922962
39  1.248530  1.293525  1.508517  ...  0.988508  1.137608  0.880669
40  0.988777  1.205968  1.463036  ...  0.622495  0.776919  0.541414
41  0.941001  1.043940  1.285215  ...  0.732293  0.960420  0.595174
42  1.242508  1.321327  1.544222  ...  0.947970  1.080069  0.851396
43  1.262534  1.399453  1.633948  ...  0.900340  0.989603  0.830024
44  1.261514  1.550063  1.811688  ...  0.740700  0.702097  0.752772
45  0.690860  1.037649  1.316086  ...  0.332653  0.619282  0.213978
46  0.844844  1.114289  1.382229  ...  0.483303  0.692881  0.388601

[47 rows x 767 columns]

Moreover, is this the best approach? Sorry, but am not sure, that's why am putting this up.

nimbous · Accepted Answer

Say df_1 and df_2 are your dataframes, first extract your pairs as shown below:

pairs_1 = list(tuple(zip(df_1.X, df_1.Y)))
pairs_2 = list(tuple(zip(df_2.X, df_2.Y)))

Then iterate over pairs as per your use case and get the index of minimum distance for the iterated points:

from scipy import spatial
min_distances = []
closest_pairs = []
names = []
for i in pairs_2:
    min_dist = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').min()
    index_min = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').argmin()
    min_distances.append(min_dist)
    closest_pairs.append(df_1.loc[index_min, ['X', 'Y']])
    names.append(df_1.loc[index_min, 'Name'])

Insert results to df_2:

df_2['min_distance'] = min_distances
df_2['closest_pairs'] = [tuple(i.values) for i in closest_pairs]
df_2['name'] = names

df_2

Output:

Id  X   Y   min_distance    closest_pairs   name
0   1   -0.288453   0.076105    0.608521    (0.32, 0.067)   Expectant
1   4   -0.563453   -0.498895   1.049155    (0.32, 0.067)   Expectant
2   5   -0.788453   -0.673895   1.333264    (0.32, 0.067)   Expectant
3   6   -0.063453   -0.373895   0.584316    (0.32, 0.067)   Expectant
4   7   0.311547    0.376105    0.250027    (0.33, 0.127)   Passionate

I have added min_distance and closest_pairs as well, you can exclude these columns if you want to.

Euclidean Distance over 2 dataframes

Answers (1)

Related Questions