Dylan Lunde
Dylan Lunde

Reputation: 61

How to drop multiple columns based on their correlation with another column in the data frame?

I have a large data frame that I am cleaning for a machine learning linear regression model. I want to drop the columns that have a correlation with my dependent variable below .5 and above -.5. What is the best way to accomplish this in as little code as possible? Here's an example of a failed attempt I'm trying to work out:

df.drop(df.loc[:, df.corrwith(df['saleprice'])] <.5 & > -.5, axis=1, inplace=True)

Upvotes: 1

Views: 377

Answers (1)

jezrael
jezrael

Reputation: 862581

Use Series.between with inclusive=False and for drop columns change logic - get all columns which not match mask by invering it by ~:

df = pd.DataFrame({
         'a':[4,5,4,5,5,4],
         'b':[7,8,9,4,2,3],
         'c':[1,3,5,7,1,0],
         'saleprice':[5,3,6,9,2,4],

})

df1 = df.loc[:, ~df.corrwith(df['saleprice']).between(-.5, .5, inclusive=False)]
print (df1)
   c  saleprice
0  1          5
1  3          3
2  5          6
3  7          9
4  1          2
5  0          4

Detail:

print (df.corrwith(df['saleprice']).between(-.5, .5, inclusive=False))
a             True
b             True
c            False
saleprice    False
dtype: bool

Upvotes: 3

Related Questions