Reputation: 1066

Is there a elegant way to only keep top[2~3] value for each row in a matrix?

A easier way has updated in the end of the question.

What I have

I have a user-user correlation matrix called matrixcorr_of_user like the one below:

userId       316       320       359       370       910
userId                                                  
316     1.000000  0.202133  0.208618  0.176050  0.174035
320     0.202133  1.000000  0.242837  0.019035  0.031737
359     0.208618  0.242837  1.000000  0.357620  0.175914
370     0.176050  0.019035  0.357620  1.000000  0.317371
910     0.174035  0.031737  0.175914  0.317371  1.000000

What I want

For every user, I just want to keep the 2 other users that are the most similar to him (the highest correlation values per row after excluding the elements of the diagonal). Like so:

Out[40]: 
userId          316       320       359       370       910
corr_user                                                  
316             NaN  0.202133  0.208618       NaN       NaN
320        0.202133       NaN  0.242837       NaN       NaN
359             NaN  0.242837       NaN  0.357620       NaN
370             NaN       NaN  0.357620       NaN  0.317371
910             NaN       NaN  0.175914  0.317371       NaN

I know how to achieve it, but the way I came up with is too complicated. Could anyone provide a better idea?

What I have tried

I first melt the matrix:

melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")

melted_corr.head()
Out[23]: 
   userId corr_user     value
0     316       316  1.000000
1     320       316  0.202133
2     359       316  0.208618
3     370       316  0.176050
4     910       316  0.174035

filter it row by row:

get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]

filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)

filted
Out[39]: 
corr_user  userId
316        359       0.208618
           320       0.202133
320        359       0.242837
           316       0.202133
359        370       0.357620
           320       0.242837
370        359       0.357620
           910       0.317371
910        370       0.317371
           359       0.175914

and finally reshape it:

filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]: 
userId          316       320       359       370       910
corr_user                                                  
316             NaN  0.202133  0.208618       NaN       NaN
320        0.202133       NaN  0.242837       NaN       NaN
359             NaN  0.242837       NaN  0.357620       NaN
370             NaN       NaN  0.357620       NaN  0.317371
910             NaN       NaN  0.175914  0.317371       NaN

Updated:

I came up with a easier way to do this after saw the answer of @John Zwinck

Let's say there is a new matrix df with some dupicated value and NaN

userId  316       320       359       370       910
userId                                             
316     1.0  0.500000  0.500000  0.500000       NaN
320     0.5  1.000000  0.242837  0.019035  0.031737
359     0.5  0.242837  1.000000  0.357620  0.175914
370     0.5  0.019035  0.357620  1.000000  0.317371
910     NaN  0.031737  0.175914  0.317371  1.000000

At first I get the rank of each row.

rank = df.rank(1, ascending=False, method="first")

Then I use df.isin() to get the mask that I want.

mask = rank.isin(list(range(2,4)))

Finally

df.where(mask)

Then I get want I want.

userId  316  320       359       370  910
userId                                   
316     NaN  0.5  0.500000       NaN  NaN
320     0.5  NaN  0.242837       NaN  NaN
359     0.5  NaN       NaN  0.357620  NaN
370     0.5  NaN  0.357620       NaN  NaN
910     NaN  NaN  0.175914  0.317371  NaN

Upvotes: 5

Answers (3)

John Zwinck

Reputation: 249183

First, use np.argsort() to find which locations have the highest values:

sort = np.argsort(df)

This gives a DataFrame whose column names are meaningless, but the second and third columns from the right contain the desired indices within each row:

        316  320  359  370  910
userId                         
316       4    3    1    2    0
320       3    4    0    2    1
359       4    0    1    3    2
370       1    0    4    2    3
910       1    0    2    3    4

Next, construct a boolean mask, set to true in the above locations:

mask = np.zeros(df.shape, bool)
rows = np.arange(len(df))
mask[rows, sort.iloc[:,-2]] = True
mask[rows, sort.iloc[:,-3]] = True

Now you have the mask you need:

array([[False,  True,  True, False, False],
       [ True, False,  True, False, False],
       [False,  True, False,  True, False],
       [False, False,  True, False,  True],
       [False, False,  True,  True, False]], dtype=bool)

Finally, df.where(mask):

             316       320       359       370       910
userId                                                  
316          NaN  0.202133  0.208618       NaN       NaN
320     0.202133       NaN  0.242837       NaN       NaN
359          NaN  0.242837       NaN  0.357620       NaN
370          NaN       NaN  0.357620       NaN  0.317371
910          NaN       NaN  0.175914  0.317371       NaN

Upvotes: 4

Svetlin Mladenov

Reputation: 4427

Here is my numpy-esque solution:

top_k = 3
top_corr = corr_of_user.copy()
top_ndarray = top_corr.values
np.fill_diagonal(top_ndarray, np.NaN)
rows = np.arange(top_corr.shape[0])[:, np.newaxis]
columns = top_ndarray.argsort()[:, :-top_k]
top_ndarray[rows, columns] = np.NaN
top_corr

And we get

userId       316       320       359       370       910
userId
316          NaN  0.202133  0.208618       NaN       NaN
320     0.202133       NaN  0.242837       NaN       NaN
359          NaN  0.242837       NaN  0.357620       NaN
370          NaN       NaN  0.357620       NaN  0.317371
910          NaN       NaN  0.175914  0.317371       NaN

You can replace top_corr = corr_of_user.copy() with top_corr = corr_of_user if you don't want a copy but rather an in-place solution.

The ideas is pretty much the same as John Zwinck's idea - get an index of the necessary fields and use it to index into the array and clear the values we don't need. A slight advantage of my solution is that K (the number of top results that we want) is a parameter and not hardcoded. It also works when corr_of_user has all 1s.

Upvotes: 1

ags29

Reputation: 2696

This should work:

melted_corr['group_rank']=melted_corr.groupby('userId')['value']\
.rank(ascending=False)

then select the top-2 per user with:

melted_corr[melted_corr.group_rank<=2]