z991
z991

Reputation: 713

Filter columns by the percentile of values in Pandas

I would like to filter out columns with 'many' zero values in pandas. It is not difficult to filter columns consist of 'all zero values', but what I want to do is filter columns with 'many zero values', for example, more than 75% of the column values.

Dataset (A has 3 zeros of 4 values, which is 75% of the column values.)

A B C
0 0 0
0 0 1
0 1 2
1 2 3

Result

B C
0 0
0 1 
1 2
2 3

Are there any suggestion?

Upvotes: 2

Views: 509

Answers (2)

jezrael
jezrael

Reputation: 862521

You can use mean of boolean mask, where True values are converted to 1, and then invert condition - >= to < for filter all columns with less mean as 0.75:

print ((df == 0).mean())
a    0.75
b    0.50
c    0.25
dtype: float64

print ((df == 0).mean() < 0.75)
a    False
b     True
c     True
dtype: bool

df = df.loc[:, (df == 0).mean() < 0.75]
print (df)
   b  c
0  0  1
1  0  1
2  1  0
3  1  1

Upvotes: 3

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can use list comprehension , sum function and drop i.e

df = pd.DataFrame({'a':[0,0,0,1],'b':[0,0,1,1],'c':[1,1,0,1]})
mask = [i for i,j in enumerate((df == 0).astype(int).sum(axis=0).values)  if (j/df.shape[0])>=0.75]
df = df.drop(df.columns[mask], axis=1)

You can set the percentile as you like.

Output : if 75 percentile i.e (j/df.shape[0])>=0.75

 
   b  c
0  0  1
1  0  1
2  1  0
3  1  1

Output : if 50 percentile i.e (j/df.shape[0])>=0.50

   c
0  1
1  1
2  0
3  1

Upvotes: 1

Related Questions