How to select the pairs of features that have correlation higher than a threshold in pandas?

Question

I have a correlation matrix which is a pandas dataframe that looks like this:

import pandas as pd
foo = pd.DataFrame({'vars': ['col_a', 'col_b', 'col_c', 'col_d'],
                   'col_a': [1, 0.9, 0.04, 0.03],
                   'col_b': [0.9,1,0.05,0.03],
                   'col_c': [0.04, 0.05, 1, -0.04],
                   'col_d': [0.03, 0.03, -0.04,1]})

I would like to get all the unique "pairs" that on absolute value have a correlation above a certain threshold and exclude self correlation

So, if the threshold is 0.8, I should get something like this:

[('col_a', 'col_b')]

Any ideas how I could do that ?

mozway · Accepted Answer

You can set 'vars' as index, stack and use the output for slicing:

pairs = foo.set_index('vars').stack()
pairs[pairs.abs().gt(0.8)]

output:

vars        
col_a  col_a    1.0
       col_b    0.9
col_b  col_a    0.9
       col_b    1.0
col_c  col_c    1.0
col_d  col_d    1.0

As list:

pairs = foo.set_index('vars').stack()
list(pairs[pairs.gt(0.8)].index)

[('col_a', 'col_a'), ('col_a', 'col_b'), ('col_b', 'col_a'), ('col_b', 'col_b'), ('col_c', 'col_c'), ('col_d', 'col_d')]

If you want to get only the unique pairs (e.g., B vs A == A vs B) and drop self correlation (e.g., A vs A), use this alternative.

np.triu enables to keep only one of the triangles in the correlation matrix, and the k parameter allows to shift the diagonal (k=0 keeps the diagonal, k=1 removes the diagonal, thus if you want to keep self correlation, use k=0)

import numpy as np
foo = foo.set_index('vars')
pairs = foo.where(np.triu(foo, k=1).astype(bool)).stack()
list(pairs[pairs.abs().gt(0.8)].index)

output:

[('col_a', 'col_b')]

How to select the pairs of features that have correlation higher than a threshold in pandas?

Answers (2)

Related Questions