Reputation: 91
let's say I have a data frame that consists of points :
df1:
x y z label
1.1 2.1 3.1 2
4.1 5.1 6.1 1
7.1 8.1 9.1 0
and I also have another dataframe of points:
df2:
x y z label
4 5 6 0
7 8 9 1
1 2 3 2
is there anyway to run through df one and see which point its the closest to inside of df2 and replace the label to the label of the point its closest to so..
the result I would like:
x y z label
1.1 2.1 3.1 2
4.1 5.1 6.1 0
7.1 8.1 9.1 1
thanks for reading my question!
Upvotes: 2
Views: 732
Reputation: 1254
My first answer addresses the question as asked, but the OP wanted a generalized solution for any number of dimensions, not just three.
import numpy as np
import pandas as pd
from sklearn.neighbors import KDTree
np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'.to_numpy()
n=11 #n=number of dimensions in your sample
X = np.random.random((10, n)) # 10 points in n dimensions
Y = np.random.random((10, n))
tree = KDTree(Y, leaf_size=2)
indices=[]
#for i in range(len(X)):
#loop though the x array and find the closest point in y to each x
dist, ind = tree.query(X, k=1)
#indices.append(ind)
df1=pd.DataFrame(X)
##set the labels to the closest point to each neighbor
df1['label']=ind
The result you want is now in df1, but you can't plot it easily, or interpret it without having a crazy brain. Proof of success based on 3d version also posted herein.
Upvotes: 0
Reputation: 1254
Here is a version using kd-trees, which may be much faster for large datasets.
import numpy as np
import pandas as pd
from sklearn.neighbors import KDTree
np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'].to_numpy()
X = np.random.random((10, 3)) # 10 points in 3 dimensions
Y = np.random.random((10, 3))
tree = KDTree(Y, leaf_size=2)
#loop though the x array and find the closest point in y to each x
#note the you can find as many as k nearest neighbors by this method
#though yours only calls for the k=1 case
dist, ind = tree.query(X, k=1)
df1=pd.DataFrame(X, columns=['x','y','z'])
#set the labels to the closest point to each neighbor
df1['label']=ind
#this is cheesy, but it removes the list brackets
#get rid of the following line if you want more than k=1 nearest neighbors
df1['label']=df1['label'].str.get(0).str.get(0)
print(df1)
df1:
x y z
0 0.548814 0.715189 0.602763
1 0.544883 0.423655 0.645894
2 0.437587 0.891773 0.963663
3 0.383442 0.791725 0.528895
4 0.568045 0.925597 0.071036
5 0.087129 0.020218 0.832620
6 0.778157 0.870012 0.978618
7 0.799159 0.461479 0.780529
8 0.118274 0.639921 0.143353
9 0.944669 0.521848 0.414662
df2:
x y z
0 0.264556 0.774234 0.456150
1 0.568434 0.018790 0.617635
2 0.612096 0.616934 0.943748
3 0.681820 0.359508 0.437032
4 0.697631 0.060225 0.666767
5 0.670638 0.210383 0.128926
6 0.315428 0.363711 0.570197
7 0.438602 0.988374 0.102045
8 0.208877 0.161310 0.653108
9 0.253292 0.466311 0.244426
Out:
x y z label
0 0.548814 0.715189 0.602763 0
1 0.544883 0.423655 0.645894 6
2 0.437587 0.891773 0.963663 2
3 0.383442 0.791725 0.528895 0
4 0.568045 0.925597 0.071036 7
5 0.087129 0.020218 0.832620 8
6 0.778157 0.870012 0.978618 2
7 0.799159 0.461479 0.780529 2
8 0.118274 0.639921 0.143353 9
9 0.944669 0.521848 0.414662 3
Here is an image you can use to vet the results. The blue points are x points, and the orange are y points.
here is the code for the plot, using matplotlib version 3.0.2
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:,0],X[:,1],X[:,2])
ax.scatter(Y[:,0],Y[:,1],Y[:,2])
for i in range(len(X)): #plot each point + it's index as text above
ax.text(X[i,0],X[i,1],X[i,2], '%s' % (str(i)), size=20, zorder=1, color='blue')
for i in range(len(Y)): #plot each point + it's index as text above
ax.text(Y[i,0],Y[i,1],Y[i,2], '%s' % (str(i)), size=20, zorder=1, color='orange')
Upvotes: 2
Reputation: 43
SELECT ABS($df1 - $df2) as nearest, ...
FROM yourtable
ORDER BY nearest ASC
LIMIT 1
order them by 'X' index and then compare the $result arrays this would look for the nearest number between the tables.
https://www.w3schools.com/sql/func_sqlserver_abs.asp the ABS function returns an absolute number so it will be a good solution as long as you have entire numbers on df2.
hope it helps.
Upvotes: 0
Reputation: 323226
I can only think of distance
from scipy
from scipy.spatial import distance
df1['label']=df2.label.iloc[distance.cdist(df1.iloc[:,:-1], df2.iloc[:,:-1], metric='euclidean').argmin(1)].values
df1
Out[446]:
x y z label
0 1.1 2.1 3.1 2
1 4.1 5.1 6.1 0
2 7.1 8.1 9.1 1
Upvotes: 1