Reputation: 1383
I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x)
is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
Upvotes: 0
Views: 310
Reputation: 5177
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaN
s. Second, you want to discard them from your DataFrame
.
To get a pandas
Series
indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True
as it always tests whether 0.5==True
)
This will give True
for all column indices to keep, as .isnull()
gives True
(or 1) if it is NaN
and False
(or 0) for a valid number for every element. Then the .sum(axis=0)
sums along the columns giving the number of NaN
s in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)
Upvotes: 2
Reputation: 13510
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
Upvotes: 2