Reputation: 861
I realize there are questions already posted in the past on SO that are similar to my question here. They don't, however, give me what I want.
I have a large dataframe, data
, with 314 columns. I would like to calculate the correlation values between all pairs of only the first 30 columns of my dataframe. After that, I would like to report the top 5 correlated pairs, regardless of sign, so in terms of magnitude of the correlation coef. I realize I should use .corr()
because some dataframe cells are empty, and we don't want to include them.
Here's what I have so far. I am still working on it. Don't know the type for cc
either, which is why I didn't get as far as to report the top 5 values:
W = 30 # taking the first 30 columns
cc = np.zeros((1,W)) # pre-allocation for coefficients
for c in range(1:W) in data:
tmp = data.corr(data(:,c0),data(:,c));
cc(c) = tmp(1,2);
Here are the first 15 rows and 5 columns of the data frame:
Group Age Gender Weight Height
0 1 50 1 224 73.533514
1 1 59 0 180 62.625479
2 1 22 0 167 62.253894
3 1 48 0 113 61.476092
4 1 53 1 166 70.076665
5 1 48 1 210 71.384046
6 1 29 0 140 61.438960
7 1 44 1 181 74.992675
8 1 28 0 98 60.145635
9 1 42 1 187 71.588029
10 1 35 0 199 66.773644
11 0 54 1 228 76.971180
12 0 43 0 145 67.586941
13 1 50 0 190 67.229118
14 1 62 0 281 63.645601
Upvotes: 1
Views: 1430
Reputation: 59519
Okay, this should work. The first part gives you the absolute correlation matrix of the first 30 columns, and essentially removes the auto-correlations. The next part looks for the five overall maximum correlations by finding the absolute max, marking it down, removing it from the correlation matrix and then moving onto the next. Each element in max_list
will be like (0.8764779791676971, 'Gender', 'Height')
with the abs. correlation and the two columns that give that correlation.
import pandas as pd
import numpy as np
corr = data.iloc[:,0:30].corr().replace(1, np.NaN).abs()
max_list = []
for i in range(0,5):
max_val = max(corr.max())
max_list.append((max_val, corr.columns[np.where(corr == max_val)[0][:]][0],
corr.columns[np.where(corr == max_val)[0][:]][1]))
corr.replace(max_val, np.NaN, inplace=True)
Upvotes: 1