Reputation: 583
I have a large dataframe of the form:
user_id time_interval A B C D E F G H ... Z
0 12166 2.0 3.0 1.0 1.0 1.0 3.0 1.0 1.0 1.0 ... 0.0
1 12167 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
2 12168 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
3 12169 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
4 12170 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:
user_id neighbors
12166 [12251,12345, ...]
12167 [12168, 12169,12170, ...]
... ...
I tried for-looping throughout the user_id list but it takes ages. I did something like this:
import scipy
neighbors = []
for i in range(len(dataframe)):
user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
neighbors.append([dataframe["user_id"][i],user_neighbors])
and I have been waiting for hours. Is there a pythonic way to improve this?
Upvotes: 1
Views: 542
Reputation: 868
Here's how I've done it using apply
method.
The dummy data consisting of columns A-D with an added column for neighbors:
print(df)
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 NaN
1 12167 0 1 4 3 3 NaN
2 12168 0 4 3 3 1 NaN
3 12169 0 2 2 3 2 NaN
4 12170 0 3 3 1 1 NaN
the custom function:
def func(row):
r = 2.5 # the threshold
out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
out.remove(row['user_id'])
df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)
the output:
print(df):
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 [12169, 12170]
1 12167 0 1 4 3 3 [12169]
2 12168 0 4 3 3 1 [12169, 12170]
3 12169 0 2 2 3 2 [12166, 12167, 12168]
4 12170 0 3 3 1 1 [12166, 12168]
Let me know if it outperforms the for-loop approach.
Upvotes: 1