Daniel
Daniel

Reputation: 37

Correlation Matrix: Extract Variables with High R Values

How can I get an output that would list only the variables whose absolute value correlation is greater than .7?

I would like output similar to this:

four: one, three
one: three

Thanks for your time!

Code

import pandas as pd

x={'one':[1,2,3,4],'two':[3,5,7,5],'three':[2,3,4,9],'four':[4,3,1,0],}
y=pd.DataFrame(x)
print(y.corr())

Output

           four       one     three       two
four   1.000000 -0.989949 -0.880830 -0.670820
one   -0.989949  1.000000  0.913500  0.632456
three -0.880830  0.913500  1.000000  0.262613
two   -0.670820  0.632456  0.262613  1.000000

Upvotes: 2

Views: 2392

Answers (2)

scottlittle
scottlittle

Reputation: 20822

This works for me:

corr = y.corr().unstack().reset_index() #group together pairwise
corr.columns = ['var1','var2','corr'] #rename columns to something readable
print( corr[ corr['corr'].abs() > 0.7 ] ) #keep correlation results above 0.7

You could further exclude variables with the same name (corr = 1) by changing the last line to

print( corr[ (corr['corr'].abs() > 0.7) & (corr['var1'] != corr['var2']) ] )

Upvotes: 1

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95907

If all you want is to print it out, this will work:

col_names = y.corr().columns.values

for col, row in (y.corr().abs() > 0.7).iteritems():
    print(col, col_names[row.values])

Note that this works but it might be slow because the iteritems method converts each row into a series.

Upvotes: 2

Related Questions