Nadav Kiani
Nadav Kiani

Reputation: 331

Automatic decide which feature to drop from correlation matrix in python

I want to automate my code because I want it to go over few files. Each time I want to create a correlation matrix, to determine a threshold, and if the correlation between 2 columns is higher than the threshold - to choose one from them and drop it from the data frame. I want to continue this process until I don't have any correlation that higher than the threshold.

Does anyone have an idea of how to approach this issue? Thanks!

Upvotes: 1

Views: 2775

Answers (2)

Aditya Venkat
Aditya Venkat

Reputation: 11

Since the correlation matrix is constant, we need to pick values above the threshold and drop one of them. Here are two main ways to drop one of the variables, you can either:

  1. Check correlation with the dependent variable and drop the variable with lower correlation
  2. Check the mean correlation of both variables with all variables and drop the one with higher mean correlation

More details and code can be found here

Upvotes: 1

user2653663
user2653663

Reputation: 2948

Dropping a variable doesn't change the correlation between the other variables. So you could iteratively remove the variable that has the highest number of correlations above the threshold. You may want to look into dimensionality reduction or feature importance to remove redundant variables as well.

import numpy as np

np.random.seed(42)
# 100 variables, 100 samples, to make some features
# highly correlated by random chance
x = np.random.random((100, 100))
corr = abs(np.corrcoef(x))
# Set diagonal to zero to make comparison with threshold simpler
np.fill_diagonal(corr, 0)
threshold = 0.3
# Mask to keep track of what is removed
keep_idx = np.ones(x.shape[0], dtype=bool)
for i in range(x.shape[0]):
    # Create the mask from the kept indices
    mask = np.ix_(keep_idx, keep_idx)
    # Get the number of correlations above a threshold.
    counts = np.sum(corr[mask] > threshold, axis=0)
    print(counts.shape)
    if max(counts) == 0:
        break
    # Get the worst offender and work out what the
    # original index was
    idx = np.where(keep_idx)[0][np.argmax(counts)]
    # Update mask
    keep_idx[idx] = False

Upvotes: 1

Related Questions