Ernesto Lopez Fune
Ernesto Lopez Fune

Reputation: 583

Find nearest neighbors

I have a large dataframe of the form:

    user_id  time_interval  A      B       C       D       E       F       G       H    ... Z
0   12166    2.0            3.0    1.0     1.0     1.0     3.0     1.0     1.0     1.0  ... 0.0
1   12167    0.0            0.0    1.0     0.0     0.0     1.0     0.0     0.0     1.0  ... 0.0
2   12168    0.0            0.0    1.0     0.0     0.0     1.0     0.0     0.0     1.0  ... 0.0
3   12169    0.0            0.0    1.0     0.0     0.0     1.0     0.0     0.0     1.0  ... 0.0
4   12170    0.0            0.0    1.0     0.0     0.0     1.0     0.0     0.0     1.0  ... 0.0
... ...      ...            ...    ...     ...     ...     ...     ...     ...     ...  ... ...

I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:

user_id    neighbors
12166      [12251,12345, ...]
12167      [12168, 12169,12170, ...]
...        ...

I tried for-looping throughout the user_id list but it takes ages. I did something like this:

import scipy
neighbors = []
for i in range(len(dataframe)):
    user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
    neighbors.append([dataframe["user_id"][i],user_neighbors])

and I have been waiting for hours. Is there a pythonic way to improve this?

Upvotes: 1

Views: 542

Answers (1)

Rm4n
Rm4n

Reputation: 868

Here's how I've done it using apply method. The dummy data consisting of columns A-D with an added column for neighbors:

print(df)
user_id  time_interval  A  B  C  D  neighbors
0    12166              2  3  2  2  3        NaN
1    12167              0  1  4  3  3        NaN
2    12168              0  4  3  3  1        NaN
3    12169              0  2  2  3  2        NaN
4    12170              0  3  3  1  1        NaN

the custom function:

def func(row):
    r = 2.5 # the threshold
    out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
    out.remove(row['user_id'])
    df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)

the output:

   print(df):
   user_id  time_interval  A  B  C  D              neighbors
   0    12166              2  3  2  2  3         [12169, 12170]
   1    12167              0  1  4  3  3                [12169]
   2    12168              0  4  3  3  1         [12169, 12170]
   3    12169              0  2  2  3  2  [12166, 12167, 12168]
   4    12170              0  3  3  1  1         [12166, 12168]

Let me know if it outperforms the for-loop approach.

Upvotes: 1

Related Questions