Saginus
Saginus

Reputation: 159

iven a column find the highest correlated variable with the specified column

As the title indicates, I have a dataframe named df.

Given a variable ( a specified column of df) I want to find the column with the highest correlation value with that variable.

Here's what I tried soo far :

def highest_correlated(df, column):
   sol = -1
   for col in df.columns:
       while col != column:
             corr = df[column].corr(df[col])
             if corr>sol:
                sol = corr
  return sol
      

The problem with this is that it takes too much time, and at the end I don't get any results, anyone can help me find a solution?

Upvotes: 1

Views: 94

Answers (1)

Michael Szczesny
Michael Szczesny

Reputation: 5036

A small example to show the concept

df = pd.DataFrame(np.random.random((5,5)), columns=list('abcde'))
df
          a         b         c         d         e
0  0.813973  0.948999  0.291432  0.081816  0.590892
1  0.117661  0.371609  0.420920  0.007232  0.596047
2  0.285615  0.840326  0.261307  0.839936  0.050935
3  0.215191  0.236140  0.588104  0.718885  0.047986
4  0.363681  0.280523  0.249036  0.712143  0.463029

Now find the columns with the highest correlation

df.corr()['a']
a    1.000000
b    0.686173
c   -0.464374
d   -0.297666
e    0.385181

Except column 'a' we get

df.corr()['a'][1:].abs().idxmax()
'b'

If you can't arange the columns conveniently

df.corr()['a'].drop('a').abs().idxmax()
'b'

Upvotes: 1

Related Questions