Reputation: 121
I have some problems with my result:
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]
From my correlation matrix:
dataCorr = data.corr(method='pearson')
I convert this matrix to columns:
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
And after I remove diagonal line of matrix:
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]
But I still have duplicate pairs
level_0 level_1 0
LiftPushSpeed RT1EntranceSpeed 0.881714
RT1EntranceSpeed LiftPushSpeed 0.881714
How avoid this problem?
Upvotes: 5
Views: 7622
Reputation: 91
As of August 2024, the code in accepted answer now gives 2 errors:
np.bool
is deprecated.pandas
now innerly tries to invert the mask ~np.tril(np.ones(dataCorr.shape)
.TypeError: ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I found that simply put dtype=bool
to np.ones()
makes it work again.
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape, dtype=bool)))
For some additional info:
df.where()
instead of df.mask()
if you don't want pandas
to implicitly invert your conditions.df.where()
, just use np.triu()
instead of np.tril()
. Though it would matter only if you want to plot a heatmap based on the correlation.Upvotes: 0
Reputation: 862581
You can convert lower triangle of values to NaN
s and stack
remove it:
np.random.seed(12)
data = pd.DataFrame(np.random.randint(20, size=(5,6)))
print (data)
0 1 2 3 4 5
0 11 6 17 2 3 3
1 12 16 17 5 13 2
2 11 10 0 8 12 13
3 18 3 4 3 1 0
4 18 18 16 6 13 9
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
print (dataCorr)
0 1 2 3 4 5
0 NaN 0.042609 -0.041656 -0.113998 -0.173011 -0.201122
1 NaN NaN 0.486901 0.567216 0.914260 0.403469
2 NaN NaN NaN -0.412853 0.157747 -0.354012
3 NaN NaN NaN NaN 0.823628 0.858918
4 NaN NaN NaN NaN NaN 0.635730
5 NaN NaN NaN NaN NaN NaN
#in your data change 0.5 to 0.7
dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
print (dataCorr)
level_0 level_1 0
0 1 3 0.567216
1 1 4 0.914260
2 3 4 0.823628
3 3 5 0.858918
4 4 5 0.635730
Detail:
print (np.tril(np.ones(dataCorr.shape)))
[[ 1. 0. 0. 0. 0. 0.]
[ 1. 1. 0. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 1. 0. 0.]
[ 1. 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 1. 1.]]
Upvotes: 11
Reputation: 9081
Although you have removed the diagonal elements, I am afraid that's all your code is going to do at the moment.
In order to tackle the duplicate problem across, I have concatenated the two columns after sorting and then filtered out duplicates, removing the concatenated column afterwards.
Here is a complete example -
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]
# filtering out lower/upper triangular duplicates
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)
print(dataCorr)
Upvotes: 2