Arsik36
Arsik36

Reputation: 327

In Python, how to view the percentage of missing values per each column?

I am a new Data Scientist, and I am trying to write a code that will calculate the percentage of missing values per each column in a data frame.

Here is a reproducible code:

my_df = pd.DataFrame([[None, 2, 3],
                     [4, None, 6],
                     [7, 8, None]])

In this code, each column contains 33.3% of missing values. The code that I developed to try to solve my own problem is as follows:

my_df.isnull().sum() / my_df.count()

This code outputs that there are 0.5 for percentage of missing values per column, because as I learned by developing this code the function count() does not consider missing values and counts only non-null values.

How can I overcome this problem and get the correct answer to this problem that states that there the % of missing values per each column is 0.33, and not 0.5?

Thank you!

Upvotes: 1

Views: 5251

Answers (3)

Hitul Adatiya
Hitul Adatiya

Reputation: 47

Try the below snippet of code. This should help identify the percent missing rounded to the nearest percent.

percent_missing = (df.isnull().sum().sort_values(ascending = False) * 100 / len(df)).round(2)
percent_missing

Upvotes: 0

Ricardo
Ricardo

Reputation: 374

give this a try:

my_df.isnull().sum()/len(my_df)

Upvotes: 1

Prune
Prune

Reputation: 77900

You have it in front of you -- assuming that you want to use your existing code as a starting point. count omits the null values, but you counted them in the numerator. Simply add that value to the denominator:

my_df.isnull().sum() / ( my_df.count() + my_df.isnull().sum() )

Optimization should cause the generated code to cache the sum result, making only one chain of calls.

Better yet, use len to get the denominator; the resulting code is much easier to read.

Upvotes: 1

Related Questions