Stanislav Jirak
Stanislav Jirak

Reputation: 853

Is there a way how to automatically select only features with a good correlation from a large dataset?

I have a dataset with 50+ columns and would like to drop low-correlated features with respect to a target using a loop, so I don't need to drop them manually.

I've tried:

for feature in df:
        if df[feature].corr() < threshold: df.drop(feature, axis=1, inplace=True)

...which obviuosly does not work. I'm quite new to Python.

Advise would be appreciated.

Upvotes: 0

Views: 103

Answers (1)

perl
perl

Reputation: 9941

Assuming that the target is in df['y']:

df = pd.DataFrame({
    'a': range(500),
    'b': np.random.randint(0, 500, 500),
    'c': range(500),
    'd': np.random.randint(0, 500, 500),
    'y': range(500)})

threshold = 0.5
for feature in [c for c in df.columns if c != 'y']:
    if abs(df[feature].corr(df['y'])) < threshold:
        del df[feature]

df.head()

Output:

   a  c  y
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4

Upvotes: 1

Related Questions