Reputation: 4482
I have a correlation matrix
which is a pandas
dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'vars': ['col_a', 'col_b', 'col_c', 'col_d'],
'col_a': [1, 0.9, 0.04, 0.03],
'col_b': [0.9,1,0.05,0.03],
'col_c': [0.04, 0.05, 1, -0.04],
'col_d': [0.03, 0.03, -0.04,1]})
I would like to get all the unique "pairs" that on absolute value have a correlation above a certain threshold
and exclude self correlation
So, if the threshold is 0.8
, I should get something like this:
[('col_a', 'col_b')]
Any ideas how I could do that ?
Upvotes: 3
Views: 2545
Reputation: 1
You can use a loop. Try that. Firstly, drop the vars column and take the correlations.
foo = foo.drop('vars', axis = 1).corr()
Then with this loop take the correlations between the conditions. 0.8 and 0.99 (to avoid itself)
a = []
b = []
for i in foo.columns:
for ii in range(len(foo.columns)):
if (foo[i].iloc[ii] > 0.8) and (foo[i].iloc[ii] < 0.99):
a.append(i)
b.append(foo[i].iloc[ii])
You can see the features and corrs with a and b list.
Then if you want to visualize;
df = (foo > 0.95 ) & (foo < 1)
df.applymap(lambda x: True if x else np.nan).dropna(how = 'all', axis = 1).dropna(how = 'all').fillna('-')
Upvotes: 0
Reputation: 260455
You can set 'vars' as index, stack
and use the output for slicing:
pairs = foo.set_index('vars').stack()
pairs[pairs.abs().gt(0.8)]
output:
vars
col_a col_a 1.0
col_b 0.9
col_b col_a 0.9
col_b 1.0
col_c col_c 1.0
col_d col_d 1.0
As list:
pairs = foo.set_index('vars').stack()
list(pairs[pairs.gt(0.8)].index)
[('col_a', 'col_a'), ('col_a', 'col_b'), ('col_b', 'col_a'), ('col_b', 'col_b'), ('col_c', 'col_c'), ('col_d', 'col_d')]
If you want to get only the unique pairs (e.g., B vs A == A vs B) and drop self correlation (e.g., A vs A), use this alternative.
np.triu
enables to keep only one of the triangles in the correlation matrix, and the k
parameter allows to shift the diagonal (k=0
keeps the diagonal, k=1
removes the diagonal, thus if you want to keep self correlation, use k=0
)
import numpy as np
foo = foo.set_index('vars')
pairs = foo.where(np.triu(foo, k=1).astype(bool)).stack()
list(pairs[pairs.abs().gt(0.8)].index)
output:
[('col_a', 'col_b')]
Upvotes: 3