Reputation: 91

how to calculate distance from a data frame compared to another data frame?

let's say I have a data frame that consists of points :

df1:

x   y  z     label
1.1 2.1 3.1     2
4.1 5.1 6.1     1
7.1 8.1 9.1     0

and I also have another dataframe of points:

df2:

x  y  z   label
4  5  6    0
7  8  9    1
1  2  3    2

is there anyway to run through df one and see which point its the closest to inside of df2 and replace the label to the label of the point its closest to so..

the result I would like:

x   y    z  label
1.1 2.1 3.1   2
4.1 5.1 6.1   0
7.1 8.1 9.1   1

thanks for reading my question!

Upvotes: 2

Answers (4)

bart cubrich

Reputation: 1254

My first answer addresses the question as asked, but the OP wanted a generalized solution for any number of dimensions, not just three.

import numpy as np
import pandas as pd
from  sklearn.neighbors import KDTree


np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'.to_numpy()
n=11    #n=number of dimensions in your sample
X = np.random.random((10, n))  # 10 points in n dimensions
Y = np.random.random((10, n))
tree = KDTree(Y, leaf_size=2)  

indices=[]
#for i in range(len(X)):
    #loop though the x array and find the closest point in y to each x       
dist, ind = tree.query(X, k=1) 
#indices.append(ind)     
df1=pd.DataFrame(X)  
##set the labels to the closest point to each neighbor
df1['label']=ind

The result you want is now in df1, but you can't plot it easily, or interpret it without having a crazy brain. Proof of success based on 3d version also posted herein.

Upvotes: 0

bart cubrich

Reputation: 1254

Here is a version using kd-trees, which may be much faster for large datasets.

import numpy as np
import pandas as pd
from  sklearn.neighbors import KDTree
np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'].to_numpy()
X = np.random.random((10, 3))  # 10 points in 3 dimensions
Y = np.random.random((10, 3))
tree = KDTree(Y, leaf_size=2)  


#loop though the x array and find the closest point in y to each x  
#note the you can find as many as k nearest neighbors by this method
#though yours only calls for the k=1 case
dist, ind = tree.query(X, k=1) 

df1=pd.DataFrame(X, columns=['x','y','z']) 

#set the labels to the closest point to each neighbor
df1['label']=ind 

#this is cheesy, but it removes the list brackets 
#get rid of the following line if you want more than k=1 nearest neighbors
df1['label']=df1['label'].str.get(0).str.get(0)  
print(df1)

df1:
          x         y         z
0  0.548814  0.715189  0.602763
1  0.544883  0.423655  0.645894
2  0.437587  0.891773  0.963663
3  0.383442  0.791725  0.528895
4  0.568045  0.925597  0.071036
5  0.087129  0.020218  0.832620
6  0.778157  0.870012  0.978618
7  0.799159  0.461479  0.780529
8  0.118274  0.639921  0.143353
9  0.944669  0.521848  0.414662
df2:
          x         y         z
0  0.264556  0.774234  0.456150
1  0.568434  0.018790  0.617635
2  0.612096  0.616934  0.943748
3  0.681820  0.359508  0.437032
4  0.697631  0.060225  0.666767
5  0.670638  0.210383  0.128926
6  0.315428  0.363711  0.570197
7  0.438602  0.988374  0.102045
8  0.208877  0.161310  0.653108
9  0.253292  0.466311  0.244426

Out:
          x         y         z  label
0  0.548814  0.715189  0.602763      0
1  0.544883  0.423655  0.645894      6
2  0.437587  0.891773  0.963663      2
3  0.383442  0.791725  0.528895      0
4  0.568045  0.925597  0.071036      7
5  0.087129  0.020218  0.832620      8
6  0.778157  0.870012  0.978618      2
7  0.799159  0.461479  0.780529      2
8  0.118274  0.639921  0.143353      9
9  0.944669  0.521848  0.414662      3

Here is an image you can use to vet the results. The blue points are x points, and the orange are y points.

here is the code for the plot, using matplotlib version 3.0.2

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:,0],X[:,1],X[:,2])
ax.scatter(Y[:,0],Y[:,1],Y[:,2])
for i in range(len(X)): #plot each point + it's index as text above
    ax.text(X[i,0],X[i,1],X[i,2],  '%s' % (str(i)), size=20, zorder=1, color='blue') 
for i in range(len(Y)): #plot each point + it's index as text above
    ax.text(Y[i,0],Y[i,1],Y[i,2],  '%s' % (str(i)), size=20, zorder=1, color='orange')

Upvotes: 2

Esteban Luques

Reputation: 43

SELECT ABS($df1 - $df2) as nearest, ...
FROM yourtable
ORDER BY nearest ASC
LIMIT 1

order them by 'X' index and then compare the $result arrays this would look for the nearest number between the tables.

https://www.w3schools.com/sql/func_sqlserver_abs.asp the ABS function returns an absolute number so it will be a good solution as long as you have entire numbers on df2.

hope it helps.

Upvotes: 0

BENY

Reputation: 323226

I can only think of distance from scipy

from scipy.spatial import distance
df1['label']=df2.label.iloc[distance.cdist(df1.iloc[:,:-1], df2.iloc[:,:-1], metric='euclidean').argmin(1)].values
df1
Out[446]: 
     x    y    z  label
0  1.1  2.1  3.1      2
1  4.1  5.1  6.1      0
2  7.1  8.1  9.1      1

Upvotes: 1

how to calculate distance from a data frame compared to another data frame?

Answers (4)

Related Questions