Reputation: 935

How to optimize a double for loop with condition on pandas dataframe?

I have these two dataframes :

df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df
   Points  Axis1  Axis2  ClusterId
0       0      1      4          1
1       1      2      2          2
2       2      2      3          2
3       3      3      0          3

Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan

Neighbour
   Points  Closest  Distance
0       0      NaN       NaN
1       1      NaN       NaN
2       2      NaN       NaN
3       3      NaN       NaN

I would like that the Closest column contains the closest point which is NOT in the same cluster (ClusterId in df), based on the following distance function, applied to Axis1 and Axis2 :

def distance(x1,y1,x2,y2):
    dist = sqrt((x1-x2)**2 + (y1-y2)**2)
    return dist

And I would like that the Distance column contains the distance between the point and its closest point.

The following script works but I think it is really not the best way to do in Python :

for i in range(len(Neighbour['Points'])): 
    bestD = -1 #best distance
    #bestP for best point
    for ii in range(len(Neighbour['Points'])): 
        if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
            dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
            if dist < bestD or bestD == -1:
                bestD = dist
                bestP = Neighbour.iloc[ii,0]
    Neighbour.loc[i,'Closest'] = bestP
    Neighbour.loc[i,'Distance'] = bestD

Neighbour
   Points  Closest  Distance
0       0      2.0  1.414214
1       1      0.0  2.236068
2       2      0.0  1.414214
3       3      1.0  2.236068

Is there a more effective way to fill the Closest and Distance columns (especially, without the for loops)? It might be an appropriate occasion to use map and reduce but I don't really see how.

Upvotes: 3

Answers (3)

Ewdlam

Reputation: 935

To complete @politinsa answer, the following script allows to compare the performance of both methods :

from sklearn.datasets import make_moons
from sklearn.utils import check_random_state
import numpy as np
import timeit
import pandas as pd
from math import sqrt
from scipy.spatial.distance import cdist

def distance(x1,y1,x2,y2):
    dist = sqrt((x1-x2)**2 + (y1-y2)**2)
    return dist 

X,y = make_moons(n_samples=1000, noise=0.1)
W = list(range(1000))
rs = check_random_state(0)
Z = rs.randint(0, 10, size=(1000,))
df = pd.DataFrame(dict(Points=W, Axis1=X[:,0], Axis2=X[:,1],ClusterId=Z))
Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan

start = timeit.default_timer()

for i in range(len(Neighbour['Points'])): 
    bestD = -1 #best distance
    for ii in range(len(Neighbour['Points'])): 
        if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
            dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
            if dist < bestD or bestD == -1:
                bestD = dist
                bestP = Neighbour.iloc[ii,0]
    Neighbour.loc[i,'Closest'] = int(bestP)
    Neighbour.loc[i,'Distance'] = bestD

stop = timeit.default_timer()
print('Time initial script: ', stop - start)

start = timeit.default_timer()

distance_matrix = cdist(df.values[:, 1:3], df.values[:, 1:3])
np.fill_diagonal(distance_matrix, np.inf) # set diagonal to inf so minimum isn't distance(x, x) = 0
for i in range(len(Neighbour['Points'])):
    same_cluster = True
    while same_cluster:
        index_min = np.argmin(distance_matrix[i])
        same_cluster = (df.loc[i,"ClusterId"] == df.loc[index_min,"ClusterId"])
        if same_cluster:
            distance_matrix[i][index_min] = np.inf
    Neighbour.loc[i,'Closest'] = index_min
    Neighbour.loc[i,'Distance'] = distance_matrix[i][index_min]
stop = timeit.default_timer()
print('Time @politinsa\'s script: ', stop - start)

Out (in seconds):

Time initial script:  70.62462342600003
Time @politinsa's script:  0.6489833670000235

Upvotes: 1

politinsa

Reputation: 3720

To compute the distance you can use scipy.spatial.distance.cdist on the underlying ndarray of your DataFrame. This might be faster that your double loop.

>>> import numpy as np
>>> from scipy.spatial.distance import cdist

>>> distance_matrix = cdist(df.values[:, 1:3], df.values[:, 1:3], 'euclidean')
>>> distance_matrix
array([[0.        , 2.23606798, 1.41421356, 4.47213595],
       [2.23606798, 0.        , 1.        , 2.23606798],
       [1.41421356, 1.        , 0.        , 3.16227766],
       [4.47213595, 2.23606798, 3.16227766, 0.        ]])
>>> np.fill_diagonal(distance_matrix, np.inf) # set diagonal to inf so minimum isn't distance(x, x) = 0
>>> distance_matrix
array([[       inf, 2.23606798, 1.41421356, 4.47213595],
       [2.23606798,        inf, 1.        , 2.23606798],
       [1.41421356, 1.        ,        inf, 3.16227766],
       [4.47213595, 2.23606798, 3.16227766,        inf]])

To speed things up you could also check the pdist function instead of the cdist, it's taking way less memory for when you'll have 50_000 rows.
There is also KDTree aiming to find the nearest neighbours of a point.

Then you can use np.argmin to get the closest distance, and check if the closest point is in the cluster, like this (I didn't try though):

for i in range(len(Neighbour['Points'])):
    same_cluster = True
    while same_cluster:
        index_min = np.argmin(distance_matrix[i])
        same_cluster = (df.loc[i,"ClusterId"] == df.loc[index_min,"ClusterId"])
        if same_cluster:
            distance_matrix[i][index_min] = np.inf
    Neighbour.loc[i,'Closest'] = index_min
    Neighbour.loc[i,'Distance'] = distance_matrix[i][index_min]

Upvotes: 2

baranbartu

Reputation: 63

You can create cartesian product first and and apply new column as distance accordingly using below distance function

def distance(row):
    x1 = row.Axis1_x
    y1 = row.Axis2_x
    x2 = row.Axis1_y
    y2 = row.Axis2_y
    dist = math.sqrt((x1-x2)**2 + (y1-y2)**2)
    return dist


df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df['join_key'] = '12345'
df = df.merge(df, how='outer', on='join_key')
df['distance'] = df.apply(distance, axis=1)
df = df.drop(columns=['join_key'])

So you will see a cartesian df like below

from now on, you ll see each point to each point distance. I assume that the hardest part is this. Please let me know if it helps.

Upvotes: 0

How to optimize a double for loop with condition on pandas dataframe?

Answers (3)

Related Questions