Is there a way how to automatically select only features with a good correlation from a large dataset?

Question

I have a dataset with 50+ columns and would like to drop low-correlated features with respect to a target using a loop, so I don't need to drop them manually.

I've tried:

for feature in df:
        if df[feature].corr() < threshold: df.drop(feature, axis=1, inplace=True)

...which obviuosly does not work. I'm quite new to Python.

Advise would be appreciated.

perl · Accepted Answer

Assuming that the target is in df['y']:

df = pd.DataFrame({
    'a': range(500),
    'b': np.random.randint(0, 500, 500),
    'c': range(500),
    'd': np.random.randint(0, 500, 500),
    'y': range(500)})

threshold = 0.5
for feature in [c for c in df.columns if c != 'y']:
    if abs(df[feature].corr(df['y'])) < threshold:
        del df[feature]

df.head()

Output:

Is there a way how to automatically select only features with a good correlation from a large dataset?

Answers (1)

Related Questions