Reputation: 813

Find similar rows based on mixed datatypes in pandas dataframe

Current dataset of users looks like this.. For any new user that joins in I need to find the most similar user in the dataset.

Earlier I had tried making a sparse matrix using the pivot command on one particular feature and then using corrwith method. How can I do it taking into account all the features at once? All I need is the id of an existing user.

Upvotes: 2

Answers (1)

Parthasarathy Subburaj

Reputation: 4264

You could find the euclidean distance between your new user and all the existing users in the data frame and use it as a measure of dissimilarity between them. And finally return the user with minimum dissimilarity. However, we should make sure that all your features are normalized, since we don't want features measured in wider ranges to overpower the ones measured in smaller ranges.

import pandas as pd
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.Normalizer()

df = df_original.drop(["id"], axis=1)          # we don't want `id` to participate in dissimilarity measure
scaled_data = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_data, columns= df.columns)

new_user_original = np.array([999999, 50, 1, 72, 160, 4, 2, 5])   
new_user = new_user_original[1:len(new_user_original)]
new_user_scaled = scaler.transform(np.expand_dims(new_user, axis=0))

dist_df = pd.DataFrame(columns=["index", "similarity"])

for _, i in df_scaled.iterrows():
    dist = np.linalg.norm(i-np.squeeze(new_user_scaled,0))
    dist_df.loc[_,:] = [_, dist]


df_original.loc[_,:] = new_user_original          # we are appending the new user to the original df
index_most_similar = df_original.loc[dist_df.similaity.idxmin(),"id"]

Upvotes: 3

Find similar rows based on mixed datatypes in pandas dataframe

Answers (1)

Related Questions