florence-y
florence-y

Reputation: 861

Calculating pairwise correlations of large dataframe among a certain number of columns

I realize there are questions already posted in the past on SO that are similar to my question here. They don't, however, give me what I want.

I have a large dataframe, data, with 314 columns. I would like to calculate the correlation values between all pairs of only the first 30 columns of my dataframe. After that, I would like to report the top 5 correlated pairs, regardless of sign, so in terms of magnitude of the correlation coef. I realize I should use .corr() because some dataframe cells are empty, and we don't want to include them.

Here's what I have so far. I am still working on it. Don't know the type for cc either, which is why I didn't get as far as to report the top 5 values:

W = 30 # taking the first 30 columns
cc = np.zeros((1,W)) # pre-allocation for coefficients

for c in range(1:W) in data:
    tmp = data.corr(data(:,c0),data(:,c));
    cc(c) = tmp(1,2);

Here are the first 15 rows and 5 columns of the data frame:

    Group  Age  Gender  Weight     Height
0       1   50       1     224  73.533514
1       1   59       0     180  62.625479
2       1   22       0     167  62.253894
3       1   48       0     113  61.476092
4       1   53       1     166  70.076665
5       1   48       1     210  71.384046
6       1   29       0     140  61.438960
7       1   44       1     181  74.992675
8       1   28       0      98  60.145635
9       1   42       1     187  71.588029
10      1   35       0     199  66.773644
11      0   54       1     228  76.971180
12      0   43       0     145  67.586941
13      1   50       0     190  67.229118
14      1   62       0     281  63.645601

Upvotes: 1

Views: 1430

Answers (1)

ALollz
ALollz

Reputation: 59519

Okay, this should work. The first part gives you the absolute correlation matrix of the first 30 columns, and essentially removes the auto-correlations. The next part looks for the five overall maximum correlations by finding the absolute max, marking it down, removing it from the correlation matrix and then moving onto the next. Each element in max_list will be like (0.8764779791676971, 'Gender', 'Height') with the abs. correlation and the two columns that give that correlation.

import pandas as pd
import numpy as np
corr = data.iloc[:,0:30].corr().replace(1, np.NaN).abs()

max_list = []
for i in range(0,5):
    max_val = max(corr.max())
    max_list.append((max_val, corr.columns[np.where(corr == max_val)[0][:]][0],
                    corr.columns[np.where(corr == max_val)[0][:]][1]))
    corr.replace(max_val, np.NaN, inplace=True)

Upvotes: 1

Related Questions