Reputation: 813
Current dataset of users looks like this..
For any new user that joins in I need to find the most similar user in the dataset.
Earlier I had tried making a sparse matrix using the pivot
command on one particular feature and then using corrwith
method.
How can I do it taking into account all the features at once?
All I need is the id
of an existing user.
Upvotes: 2
Views: 5413
Reputation: 4264
You could find the euclidean distance
between your new user and all the existing users in the data frame and use it as a measure of dissimilarity between them. And finally return the user with minimum dissimilarity. However, we should make sure that all your features are normalized, since we don't want features measured in wider ranges to overpower the ones measured in smaller ranges.
import pandas as pd
import numpy as np
from sklearn import preprocessing
scaler = preprocessing.Normalizer()
df = df_original.drop(["id"], axis=1) # we don't want `id` to participate in dissimilarity measure
scaled_data = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_data, columns= df.columns)
new_user_original = np.array([999999, 50, 1, 72, 160, 4, 2, 5])
new_user = new_user_original[1:len(new_user_original)]
new_user_scaled = scaler.transform(np.expand_dims(new_user, axis=0))
dist_df = pd.DataFrame(columns=["index", "similarity"])
for _, i in df_scaled.iterrows():
dist = np.linalg.norm(i-np.squeeze(new_user_scaled,0))
dist_df.loc[_,:] = [_, dist]
df_original.loc[_,:] = new_user_original # we are appending the new user to the original df
index_most_similar = df_original.loc[dist_df.similaity.idxmin(),"id"]
Upvotes: 3