Dropping columns on a dataframe based on their count of values

Question

Hi I am new to pandas and struggling with a manipulation. I have a dataframe df with a huge number of columns, and I only want to keep the number of columns that have a count of above 5000 values.

I tried the loop below but it does not work. Is there any easy way to do this? Also is there a function I could create to apply this to any dataframe where I want to keep columns with only n values or more?

for column in df.columns: 
   if df[column].count() > 5000: 
      column = column
   else: 
      df[column].drop()

Thanks

Erfan · Accepted Answer

We can use DataFrame.dropna which has the argument thresh, for example:

import pandas as pd
import numpy as np

# example dataframe
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, np.nan],
    'C': [np.nan, np.nan, 6],
    'D': [np.nan, np.nan, np.nan]
})


   A    B    C   D
0  1  4.0  NaN NaN
1  2  5.0  NaN NaN
2  3  NaN  6.0 NaN

We set the threshold to 2, in your case it is 5000:

df.dropna(thresh=2, axis=1)

   A    B
0  1  4.0
1  2  5.0
2  3  NaN

Notice column C and D dropped because they had less than 2 non-Na values

Dropping columns on a dataframe based on their count of values

Answers (2)

Related Questions