Reputation: 159
As the title indicates, I have a dataframe named df.
Given a variable ( a specified column of df) I want to find the column with the highest correlation value with that variable.
Here's what I tried soo far :
def highest_correlated(df, column):
sol = -1
for col in df.columns:
while col != column:
corr = df[column].corr(df[col])
if corr>sol:
sol = corr
return sol
The problem with this is that it takes too much time, and at the end I don't get any results, anyone can help me find a solution?
Upvotes: 1
Views: 94
Reputation: 5036
A small example to show the concept
df = pd.DataFrame(np.random.random((5,5)), columns=list('abcde'))
df
a b c d e
0 0.813973 0.948999 0.291432 0.081816 0.590892
1 0.117661 0.371609 0.420920 0.007232 0.596047
2 0.285615 0.840326 0.261307 0.839936 0.050935
3 0.215191 0.236140 0.588104 0.718885 0.047986
4 0.363681 0.280523 0.249036 0.712143 0.463029
Now find the columns with the highest correlation
df.corr()['a']
a 1.000000
b 0.686173
c -0.464374
d -0.297666
e 0.385181
Except column 'a' we get
df.corr()['a'][1:].abs().idxmax()
'b'
If you can't arange the columns conveniently
df.corr()['a'].drop('a').abs().idxmax()
'b'
Upvotes: 1