Reputation: 2458
I have a dataset with 56 numerical features. Loading it to pandas
, I can easily generate a correlation coefficients matrix.
However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!
Thanks!
Upvotes: 2
Views: 4174
Reputation: 3898
This approach will work if you're looking to also deduplicate the correlation results.
thresh = 0.8
# get correlation matrix
df_corr = df.corr().abs().unstack()
# filter
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()
# deduplicate
df_corr_filt.iloc[df_corr_filt[['level_0','level_1']].apply(lambda r: ''.join(map(str, sorted(r))), axis = 1).drop_duplicates().index]
Upvotes: 0
Reputation: 150745
I think you can do where
and stack()
: this:
np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))
coeff = df.corr()
# 0.3 is used for illustration
# replace with your actual value
thresh = 0.3
mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh
coeff.where(mask).stack()
Output:
0 2 -0.089326
2 0 -0.089326
dtype: float64
Output:
0 1 0.319612
2 -0.089326
1 0 0.319612
2 -0.687399
2 0 -0.089326
1 -0.687399
dtype: float64
Upvotes: 1