Reputation: 331
I want to automate my code because I want it to go over few files. Each time I want to create a correlation matrix, to determine a threshold, and if the correlation between 2 columns is higher than the threshold - to choose one from them and drop it from the data frame. I want to continue this process until I don't have any correlation that higher than the threshold.
Does anyone have an idea of how to approach this issue? Thanks!
Upvotes: 1
Views: 2775
Reputation: 11
Since the correlation matrix is constant, we need to pick values above the threshold and drop one of them. Here are two main ways to drop one of the variables, you can either:
More details and code can be found here
Upvotes: 1
Reputation: 2948
Dropping a variable doesn't change the correlation between the other variables. So you could iteratively remove the variable that has the highest number of correlations above the threshold. You may want to look into dimensionality reduction or feature importance to remove redundant variables as well.
import numpy as np
np.random.seed(42)
# 100 variables, 100 samples, to make some features
# highly correlated by random chance
x = np.random.random((100, 100))
corr = abs(np.corrcoef(x))
# Set diagonal to zero to make comparison with threshold simpler
np.fill_diagonal(corr, 0)
threshold = 0.3
# Mask to keep track of what is removed
keep_idx = np.ones(x.shape[0], dtype=bool)
for i in range(x.shape[0]):
# Create the mask from the kept indices
mask = np.ix_(keep_idx, keep_idx)
# Get the number of correlations above a threshold.
counts = np.sum(corr[mask] > threshold, axis=0)
print(counts.shape)
if max(counts) == 0:
break
# Get the worst offender and work out what the
# original index was
idx = np.where(keep_idx)[0][np.argmax(counts)]
# Update mask
keep_idx[idx] = False
Upvotes: 1