How do I remove columns in Pandas that contains non-zero in less than 1% of number of rows?

Question

I have the following dataset:

    Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10   ... 

Col991  Col992  Col993  Col994  Col995  Col996  Col997  Col998  Col999  Col1000
rows                                                                                    
Row1    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row2    0   0   0   0   0   23  0   0   0   0   ... 0   0   0   0   7   0   0   0   0   0
Row3    97  0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row4    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row5    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Row496  182 0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   116 0   0   0
Row497  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row498  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row499  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row500  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   125 0   0   0

I am trying to remow columns where total number of nonzeros entries is less than 1% of the number of rows.

I can calculate the percentage of nonzeros entries columnwise

(df[df > 0.0].count()/df.shape[0])*100

How should I use this to get df with those columns where number of columns have nonzeros in more than 1% of the rows only? Further, how should I change code to remove rows where nonzeros is less than 1% of columns?

hilipati · Accepted Answer

You can use loc to get the specified columns or rows for your new df as shown in this answer, essentially you can do this:

df.loc[rows, cols]  # accepts boolean lists/arrays

So the df with removed columns can be achieved with this:

col_condition = df[df > 0].count() / df.shape[0] >= .01
df_ = df[:, col_condition]

If you need to switch between columns and rows you can simply transpose the dataframe with

df.T

So the same for rows where number of nonzeros is less than 1% of length of columns:

row_condition = df.T[df.T > 0].count() / df.shape[1] >= .01
df_ = df[row_condition]

And in a bit more concise forms:

df_ = df.loc[:, df.gt(0).mean() >= .01]  # keep columns
df_ = df[df.T.gt(0).mean() >= .01]  # keep rows

How do I remove columns in Pandas that contains non-zero in less than 1% of number of rows?

Answers (2)

Related Questions