Reputation: 935
I have these two dataframes :
df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df
Points Axis1 Axis2 ClusterId
0 0 1 4 1
1 1 2 2 2
2 2 2 3 2
3 3 3 0 3
Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan
Neighbour
Points Closest Distance
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 NaN NaN
I would like that the Closest column contains the closest point which is NOT in the same cluster (ClusterId in df), based on the following distance function, applied to Axis1 and Axis2 :
def distance(x1,y1,x2,y2):
dist = sqrt((x1-x2)**2 + (y1-y2)**2)
return dist
And I would like that the Distance column contains the distance between the point and its closest point.
The following script works but I think it is really not the best way to do in Python :
for i in range(len(Neighbour['Points'])):
bestD = -1 #best distance
#bestP for best point
for ii in range(len(Neighbour['Points'])):
if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
if dist < bestD or bestD == -1:
bestD = dist
bestP = Neighbour.iloc[ii,0]
Neighbour.loc[i,'Closest'] = bestP
Neighbour.loc[i,'Distance'] = bestD
Neighbour
Points Closest Distance
0 0 2.0 1.414214
1 1 0.0 2.236068
2 2 0.0 1.414214
3 3 1.0 2.236068
Is there a more effective way to fill the Closest and Distance columns (especially, without the for loops)? It might be an appropriate occasion to use map and reduce but I don't really see how.
Upvotes: 3
Views: 336
Reputation: 935
To complete @politinsa answer, the following script allows to compare the performance of both methods :
from sklearn.datasets import make_moons
from sklearn.utils import check_random_state
import numpy as np
import timeit
import pandas as pd
from math import sqrt
from scipy.spatial.distance import cdist
def distance(x1,y1,x2,y2):
dist = sqrt((x1-x2)**2 + (y1-y2)**2)
return dist
X,y = make_moons(n_samples=1000, noise=0.1)
W = list(range(1000))
rs = check_random_state(0)
Z = rs.randint(0, 10, size=(1000,))
df = pd.DataFrame(dict(Points=W, Axis1=X[:,0], Axis2=X[:,1],ClusterId=Z))
Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan
start = timeit.default_timer()
for i in range(len(Neighbour['Points'])):
bestD = -1 #best distance
for ii in range(len(Neighbour['Points'])):
if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
if dist < bestD or bestD == -1:
bestD = dist
bestP = Neighbour.iloc[ii,0]
Neighbour.loc[i,'Closest'] = int(bestP)
Neighbour.loc[i,'Distance'] = bestD
stop = timeit.default_timer()
print('Time initial script: ', stop - start)
start = timeit.default_timer()
distance_matrix = cdist(df.values[:, 1:3], df.values[:, 1:3])
np.fill_diagonal(distance_matrix, np.inf) # set diagonal to inf so minimum isn't distance(x, x) = 0
for i in range(len(Neighbour['Points'])):
same_cluster = True
while same_cluster:
index_min = np.argmin(distance_matrix[i])
same_cluster = (df.loc[i,"ClusterId"] == df.loc[index_min,"ClusterId"])
if same_cluster:
distance_matrix[i][index_min] = np.inf
Neighbour.loc[i,'Closest'] = index_min
Neighbour.loc[i,'Distance'] = distance_matrix[i][index_min]
stop = timeit.default_timer()
print('Time @politinsa\'s script: ', stop - start)
Out (in seconds):
Time initial script: 70.62462342600003
Time @politinsa's script: 0.6489833670000235
Upvotes: 1
Reputation: 3720
To compute the distance you can use scipy.spatial.distance.cdist
on the underlying ndarray of your DataFrame. This might be faster that your double loop.
>>> import numpy as np
>>> from scipy.spatial.distance import cdist
>>> distance_matrix = cdist(df.values[:, 1:3], df.values[:, 1:3], 'euclidean')
>>> distance_matrix
array([[0. , 2.23606798, 1.41421356, 4.47213595],
[2.23606798, 0. , 1. , 2.23606798],
[1.41421356, 1. , 0. , 3.16227766],
[4.47213595, 2.23606798, 3.16227766, 0. ]])
>>> np.fill_diagonal(distance_matrix, np.inf) # set diagonal to inf so minimum isn't distance(x, x) = 0
>>> distance_matrix
array([[ inf, 2.23606798, 1.41421356, 4.47213595],
[2.23606798, inf, 1. , 2.23606798],
[1.41421356, 1. , inf, 3.16227766],
[4.47213595, 2.23606798, 3.16227766, inf]])
To speed things up you could also check the pdist
function instead of the cdist, it's taking way less memory for when you'll have 50_000 rows.
There is also KDTree
aiming to find the nearest neighbours of a point.
Then you can use np.argmin
to get the closest distance, and check if the closest point is in the cluster, like this (I didn't try though):
for i in range(len(Neighbour['Points'])):
same_cluster = True
while same_cluster:
index_min = np.argmin(distance_matrix[i])
same_cluster = (df.loc[i,"ClusterId"] == df.loc[index_min,"ClusterId"])
if same_cluster:
distance_matrix[i][index_min] = np.inf
Neighbour.loc[i,'Closest'] = index_min
Neighbour.loc[i,'Distance'] = distance_matrix[i][index_min]
Upvotes: 2
Reputation: 63
You can create cartesian product first and and apply new column as distance accordingly using below distance function
def distance(row):
x1 = row.Axis1_x
y1 = row.Axis2_x
x2 = row.Axis1_y
y2 = row.Axis2_y
dist = math.sqrt((x1-x2)**2 + (y1-y2)**2)
return dist
df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df['join_key'] = '12345'
df = df.merge(df, how='outer', on='join_key')
df['distance'] = df.apply(distance, axis=1)
df = df.drop(columns=['join_key'])
So you will see a cartesian df like below
from now on, you ll see each point to each point distance. I assume that the hardest part is this. Please let me know if it helps.
Upvotes: 0