Reputation: 1066
A easier way has updated in the end of the question.
I have a user-user correlation matrix called matrixcorr_of_user
like the one below:
userId 316 320 359 370 910
userId
316 1.000000 0.202133 0.208618 0.176050 0.174035
320 0.202133 1.000000 0.242837 0.019035 0.031737
359 0.208618 0.242837 1.000000 0.357620 0.175914
370 0.176050 0.019035 0.357620 1.000000 0.317371
910 0.174035 0.031737 0.175914 0.317371 1.000000
For every user, I just want to keep the 2 other users that are the most similar to him (the highest correlation values per row after excluding the elements of the diagonal). Like so:
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
I know how to achieve it, but the way I came up with is too complicated. Could anyone provide a better idea?
I first melt
the matrix:
melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")
melted_corr.head()
Out[23]:
userId corr_user value
0 316 316 1.000000
1 320 316 0.202133
2 359 316 0.208618
3 370 316 0.176050
4 910 316 0.174035
filter
it row by row:
get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]
filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)
filted
Out[39]:
corr_user userId
316 359 0.208618
320 0.202133
320 359 0.242837
316 0.202133
359 370 0.357620
320 0.242837
370 359 0.357620
910 0.317371
910 370 0.317371
359 0.175914
and finally reshape
it:
filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
I came up with a easier way to do this after saw the answer of @John Zwinck
Let's say there is a new matrix df
with some dupicated value and NaN
userId 316 320 359 370 910
userId
316 1.0 0.500000 0.500000 0.500000 NaN
320 0.5 1.000000 0.242837 0.019035 0.031737
359 0.5 0.242837 1.000000 0.357620 0.175914
370 0.5 0.019035 0.357620 1.000000 0.317371
910 NaN 0.031737 0.175914 0.317371 1.000000
At first I get the rank
of each row.
rank = df.rank(1, ascending=False, method="first")
Then I use df.isin()
to get the mask that I want.
mask = rank.isin(list(range(2,4)))
Finally
df.where(mask)
Then I get want I want.
userId 316 320 359 370 910
userId
316 NaN 0.5 0.500000 NaN NaN
320 0.5 NaN 0.242837 NaN NaN
359 0.5 NaN NaN 0.357620 NaN
370 0.5 NaN 0.357620 NaN NaN
910 NaN NaN 0.175914 0.317371 NaN
Upvotes: 5
Views: 296
Reputation: 249183
First, use np.argsort()
to find which locations have the highest values:
sort = np.argsort(df)
This gives a DataFrame whose column names are meaningless, but the second and third columns from the right contain the desired indices within each row:
316 320 359 370 910
userId
316 4 3 1 2 0
320 3 4 0 2 1
359 4 0 1 3 2
370 1 0 4 2 3
910 1 0 2 3 4
Next, construct a boolean mask, set to true in the above locations:
mask = np.zeros(df.shape, bool)
rows = np.arange(len(df))
mask[rows, sort.iloc[:,-2]] = True
mask[rows, sort.iloc[:,-3]] = True
Now you have the mask you need:
array([[False, True, True, False, False],
[ True, False, True, False, False],
[False, True, False, True, False],
[False, False, True, False, True],
[False, False, True, True, False]], dtype=bool)
Finally, df.where(mask)
:
316 320 359 370 910
userId
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
Upvotes: 4
Reputation: 4427
Here is my numpy-esque solution:
top_k = 3
top_corr = corr_of_user.copy()
top_ndarray = top_corr.values
np.fill_diagonal(top_ndarray, np.NaN)
rows = np.arange(top_corr.shape[0])[:, np.newaxis]
columns = top_ndarray.argsort()[:, :-top_k]
top_ndarray[rows, columns] = np.NaN
top_corr
And we get
userId 316 320 359 370 910
userId
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
You can replace top_corr = corr_of_user.copy()
with top_corr = corr_of_user
if you don't want a copy but rather an in-place solution.
The ideas is pretty much the same as John Zwinck's idea - get an index of the necessary fields and use it to index into the array and clear the values we don't need. A slight advantage of my solution is that K (the number of top results that we want) is a parameter and not hardcoded. It also works when corr_of_user
has all 1
s.
Upvotes: 1
Reputation: 2696
This should work:
melted_corr['group_rank']=melted_corr.groupby('userId')['value']\
.rank(ascending=False)
then select the top-2 per user with:
melted_corr[melted_corr.group_rank<=2]
Upvotes: 1