Reputation: 61
I have a large data frame that I am cleaning for a machine learning linear regression model. I want to drop the columns that have a correlation with my dependent variable below .5 and above -.5. What is the best way to accomplish this in as little code as possible? Here's an example of a failed attempt I'm trying to work out:
df.drop(df.loc[:, df.corrwith(df['saleprice'])] <.5 & > -.5, axis=1, inplace=True)
Upvotes: 1
Views: 377
Reputation: 862581
Use Series.between
with inclusive=False
and for drop columns change logic - get all columns which not match mask by invering it by ~
:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'saleprice':[5,3,6,9,2,4],
})
df1 = df.loc[:, ~df.corrwith(df['saleprice']).between(-.5, .5, inclusive=False)]
print (df1)
c saleprice
0 1 5
1 3 3
2 5 6
3 7 9
4 1 2
5 0 4
Detail:
print (df.corrwith(df['saleprice']).between(-.5, .5, inclusive=False))
a True
b True
c False
saleprice False
dtype: bool
Upvotes: 3