Find high correlations in a large coefficient matrix

Question

I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.

However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!

Thanks!

Quang Hoang · Accepted Answer

I think you can do where and stack(): this:

np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))

coeff = df.corr()

# 0.3 is used for illustration 
# replace with your actual value
thresh = 0.3

mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh

coeff.where(mask).stack()

Output:

0  2   -0.089326
2  0   -0.089326
dtype: float64

Output:

0  1    0.319612
   2   -0.089326
1  0    0.319612
   2   -0.687399
2  0   -0.089326
   1   -0.687399
dtype: float64

Find high correlations in a large coefficient matrix

Answers (2)

Related Questions