Django0602
Django0602

Reputation: 807

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.

#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing}, 
index=data.columns.values)
find_missing(data_final).head(5)

What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.

The data set looks like this

A     B      C      D
123   ABC    X      Y
123   ABC    X      Y
NaN   ABC    NaN   NaN
123   ABC    NaN   NaN
245   ABC    NaN   NaN
345   ABC    NaN   NaN

In the output I would just want to see :

     missing_count   missing_ratio
C    4               0.66
D    4               0.66

and not the columns A and B as there are no missing values there

Upvotes: 1

Views: 2507

Answers (3)

Henry Yik
Henry Yik

Reputation: 22503

You can also use concat:

s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)

   missing_count  missing_ratio
A              1       0.166667
B              0       0.000000
C              4       0.666667
D              4       0.666667

Upvotes: 1

ansev
ansev

Reputation: 30930

Use DataFrame.isna with DataFrame.sum to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.

new_df = (df.isna()
            .sum()
            .to_frame('missing_count')
            .assign(missing_ratio = lambda x: x['missing_count']/len(df))
            .loc[df.isna().any()] )
print(new_df)

We can also use pd.concat instead DataFrame.assign

count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
                     count.div(len(df))
                          .rename('missing_ratio')],axis = 1)
             .loc[count.ne(0)])

Output

   missing_count  missing_ratio
A              1       0.166667
C              4       0.666667
D              4       0.666667

Upvotes: 2

Umar.H
Umar.H

Reputation: 23099

IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.

a = df.isnull().sum(axis=0)

b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)


missing_df = pd.DataFrame({'missing_vals' : a,
             'missing_ratio' : b})

print(missing_df)
   missing_vals  ratio
A             1   0.17
B             0   0.00
C             4   0.67
D             4   0.67

you can filter out columns that don't have any missing vals

missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals  ratio
A             1   0.17
C             4   0.67
D             4   0.67

Upvotes: 1

Related Questions