antonioACR1
antonioACR1

Reputation: 1383

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?

To solve my problem, I tried the following code

df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)

The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.

The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?

Upvotes: 0

Views: 310

Answers (2)

jotasi
jotasi

Reputation: 5177

You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.

To get a pandas Series indicating whether a row should be discarded by doing, you can do:

df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]

(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)

This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.

For the second task, you can use this to index your columns by using:

df = df[df.columns[df1]]

or as suggested in the comments by doing:

df.drop(df.columns[df1==False], axis=1, inplace=True)

Upvotes: 2

Israel Unterman
Israel Unterman

Reputation: 13510

You could do this:

filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]

Upvotes: 2

Related Questions