How to remove duplicates from correlation in pandas?

I have some problems with my result:

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

From my correlation matrix:

dataCorr = data.corr(method='pearson')

I convert this matrix to columns:

dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()

And after I remove diagonal line of matrix:

dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

But I still have duplicate pairs

level_0             level_1             0
LiftPushSpeed       RT1EntranceSpeed    0.881714
RT1EntranceSpeed    LiftPushSpeed       0.881714

How avoid this problem?

Upvotes: 5

Answers (3)

Loc Quan

Reputation: 91

As of August 2024, the code in accepted answer now gives 2 errors:

np.bool is deprecated.
The below error -- due to pandas now innerly tries to invert the mask ~np.tril(np.ones(dataCorr.shape).

TypeError: ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I found that simply put dtype=bool to np.ones() makes it work again.

dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape, dtype=bool)))

For some additional info:

You can use df.where() instead of df.mask() if you don't want pandas to implicitly invert your conditions.
If you want to select the upper triangle with df.where(), just use np.triu() instead of np.tril(). Though it would matter only if you want to plot a heatmap based on the correlation.

Upvotes: 0

jezrael

Reputation: 862581

You can convert lower triangle of values to NaNs and stack remove it:

np.random.seed(12)

data = pd.DataFrame(np.random.randint(20, size=(5,6)))
print (data)
    0   1   2  3   4   5
0  11   6  17  2   3   3
1  12  16  17  5  13   2
2  11  10   0  8  12  13
3  18   3   4  3   1   0
4  18  18  16  6  13   9

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
print (dataCorr)
    0         1         2         3         4         5
0 NaN  0.042609 -0.041656 -0.113998 -0.173011 -0.201122
1 NaN       NaN  0.486901  0.567216  0.914260  0.403469
2 NaN       NaN       NaN -0.412853  0.157747 -0.354012
3 NaN       NaN       NaN       NaN  0.823628  0.858918
4 NaN       NaN       NaN       NaN       NaN  0.635730
5 NaN       NaN       NaN       NaN       NaN       NaN

#in your data change 0.5 to 0.7
dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
print (dataCorr)
   level_0  level_1         0
0        1        3  0.567216
1        1        4  0.914260
2        3        4  0.823628
3        3        5  0.858918
4        4        5  0.635730

Detail:

print (np.tril(np.ones(dataCorr.shape)))
[[ 1.  0.  0.  0.  0.  0.]
 [ 1.  1.  0.  0.  0.  0.]
 [ 1.  1.  1.  0.  0.  0.]
 [ 1.  1.  1.  1.  0.  0.]
 [ 1.  1.  1.  1.  1.  0.]
 [ 1.  1.  1.  1.  1.  1.]]

Upvotes: 11

Vivek Kalyanarangan

Reputation: 9081

Although you have removed the diagonal elements, I am afraid that's all your code is going to do at the moment.

In order to tackle the duplicate problem across, I have concatenated the two columns after sorting and then filtered out duplicates, removing the concatenated column afterwards.

Here is a complete example -

import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]

# filtering out lower/upper triangular duplicates 
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)

print(dataCorr)

Upvotes: 2

How to remove duplicates from correlation in pandas?

Answers (3)

Related Questions