Reputation: 713
I would like to filter out columns with 'many' zero values in pandas. It is not difficult to filter columns consist of 'all zero values', but what I want to do is filter columns with 'many zero values', for example, more than 75% of the column values.
Dataset (A has 3 zeros of 4 values, which is 75% of the column values.)
A B C
0 0 0
0 0 1
0 1 2
1 2 3
Result
B C
0 0
0 1
1 2
2 3
Are there any suggestion?
Upvotes: 2
Views: 509
Reputation: 862521
You can use mean
of boolean mask
, where True
values are converted to 1
, and then invert condition - >=
to <
for filter all columns with less mean
as 0.75
:
print ((df == 0).mean())
a 0.75
b 0.50
c 0.25
dtype: float64
print ((df == 0).mean() < 0.75)
a False
b True
c True
dtype: bool
df = df.loc[:, (df == 0).mean() < 0.75]
print (df)
b c
0 0 1
1 0 1
2 1 0
3 1 1
Upvotes: 3
Reputation: 30605
You can use list comprehension , sum
function and drop
i.e
df = pd.DataFrame({'a':[0,0,0,1],'b':[0,0,1,1],'c':[1,1,0,1]})
mask = [i for i,j in enumerate((df == 0).astype(int).sum(axis=0).values) if (j/df.shape[0])>=0.75]
df = df.drop(df.columns[mask], axis=1)
You can set the percentile as you like.
Output : if 75 percentile i.e (j/df.shape[0])>=0.75
b c 0 0 1 1 0 1 2 1 0 3 1 1
Output : if 50 percentile i.e (j/df.shape[0])>=0.50
c 0 1 1 1 2 0 3 1
Upvotes: 1