Reputation: 327
I am a new Data Scientist, and I am trying to write a code that will calculate the percentage of missing values per each column in a data frame.
Here is a reproducible code:
my_df = pd.DataFrame([[None, 2, 3],
[4, None, 6],
[7, 8, None]])
In this code, each column contains 33.3% of missing values. The code that I developed to try to solve my own problem is as follows:
my_df.isnull().sum() / my_df.count()
This code outputs that there are 0.5 for percentage of missing values per column, because as I learned by developing this code the function count() does not consider missing values and counts only non-null values.
How can I overcome this problem and get the correct answer to this problem that states that there the % of missing values per each column is 0.33, and not 0.5?
Thank you!
Upvotes: 1
Views: 5251
Reputation: 47
Try the below snippet of code. This should help identify the percent missing rounded to the nearest percent.
percent_missing = (df.isnull().sum().sort_values(ascending = False) * 100 / len(df)).round(2)
percent_missing
Upvotes: 0
Reputation: 77900
You have it in front of you -- assuming that you want to use your existing code as a starting point. count
omits the null values, but you counted them in the numerator. Simply add that value to the denominator:
my_df.isnull().sum() / ( my_df.count() + my_df.isnull().sum() )
Optimization should cause the generated code to cache the sum
result, making only one chain of calls.
Better yet, use len
to get the denominator; the resulting code is much easier to read.
Upvotes: 1