quant
quant

Reputation: 4482

How to select the pairs of features that have correlation higher than a threshold in pandas?

I have a correlation matrix which is a pandas dataframe that looks like this:

import pandas as pd
foo = pd.DataFrame({'vars': ['col_a', 'col_b', 'col_c', 'col_d'],
                   'col_a': [1, 0.9, 0.04, 0.03],
                   'col_b': [0.9,1,0.05,0.03],
                   'col_c': [0.04, 0.05, 1, -0.04],
                   'col_d': [0.03, 0.03, -0.04,1]})

I would like to get all the unique "pairs" that on absolute value have a correlation above a certain threshold and exclude self correlation

So, if the threshold is 0.8, I should get something like this:

[('col_a', 'col_b')]

Any ideas how I could do that ?

Upvotes: 3

Views: 2545

Answers (2)

Navarra B
Navarra B

Reputation: 1

You can use a loop. Try that. Firstly, drop the vars column and take the correlations.

foo = foo.drop('vars', axis = 1).corr()

Then with this loop take the correlations between the conditions. 0.8 and 0.99 (to avoid itself)

a = []
b = []
for i in foo.columns:
    for ii in range(len(foo.columns)):
        if (foo[i].iloc[ii] > 0.8) and (foo[i].iloc[ii] < 0.99):
            a.append(i)
            b.append(foo[i].iloc[ii])

You can see the features and corrs with a and b list.

Then if you want to visualize;

df = (foo > 0.95 ) & (foo < 1) 
df.applymap(lambda x: True if x else np.nan).dropna(how = 'all', axis = 1).dropna(how = 'all').fillna('-')

Upvotes: 0

mozway
mozway

Reputation: 260455

You can set 'vars' as index, stack and use the output for slicing:

pairs = foo.set_index('vars').stack()
pairs[pairs.abs().gt(0.8)]

output:

vars        
col_a  col_a    1.0
       col_b    0.9
col_b  col_a    0.9
       col_b    1.0
col_c  col_c    1.0
col_d  col_d    1.0

As list:

pairs = foo.set_index('vars').stack()
list(pairs[pairs.gt(0.8)].index)
[('col_a', 'col_a'), ('col_a', 'col_b'), ('col_b', 'col_a'), ('col_b', 'col_b'), ('col_c', 'col_c'), ('col_d', 'col_d')]

If you want to get only the unique pairs (e.g., B vs A == A vs B) and drop self correlation (e.g., A vs A), use this alternative.

np.triu enables to keep only one of the triangles in the correlation matrix, and the k parameter allows to shift the diagonal (k=0 keeps the diagonal, k=1 removes the diagonal, thus if you want to keep self correlation, use k=0)

import numpy as np
foo = foo.set_index('vars')
pairs = foo.where(np.triu(foo, k=1).astype(bool)).stack()
list(pairs[pairs.abs().gt(0.8)].index)

output:

[('col_a', 'col_b')]

Upvotes: 3

Related Questions