LivingstoneM
LivingstoneM

Reputation: 1088

Count Nan per variable and represent in percentage

I am new to python but I havent found a solution to this challenge. I have data with different variable per hospital. Now I do want to count total amount of missing data per variable and represent it per hospital in percentages.Here is how the dataframe looks like:

import pandas as pd
df = pd.DataFrame([('Jorh Hospital','2018-03-15', 389.0,34, 32, 34),
                   ('Jorh Hospital','2018-04-20', np.nan,22, 5, 43),
                   ('Bugh Hospital','2019-02-16', 80.5,np.nan, 56, np.nan),
                   ('Bugh Hospital','2019-06-23', np.nan,89, 67, np.nan)],
                  columns=('Hosp_name','date', 'max_rec', 'reg_pp', 'disch_no', 'temp_rec'))
df

Now we have variables collected from different hospitals and I want to count NaNs per variable for every hospital. I have to turn hospitals to be columns and variables to be rows in final table.Here is what I have tried:

df.isna().sum()

I am unable to proceed from here and this is the final result that I want.

dff = pd.DataFrame([('max_rec','50% (1)', '50%(1)'),
                   ('reg_pp','100%(0)', '50%(1)'),
                   ('disch_no','100%(0)', '100%(0)'),
                   ('temp_rec','100%(0)', '0')],
                  columns=('variables','Jorh Hospital (N=2)', 'Bugh Hospital (N=2)'))
dff

Kindly note in the final table, I need to represent missing values in percentage but still give the number of how many are they and the column has N to represent total number of rows per variable in a specific hospital

Upvotes: 1

Views: 120

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150745

With some modification on output, this should work:

df.iloc[:,2:].notna().groupby(df['Hosp_name']).mean().T

Output:

Hosp_name  Bugh Hospital  Jorh Hospital
max_rec              0.5            0.5
reg_pp               0.5            1.0
disch_no             1.0            1.0
temp_rec             0.0            1.0

Upvotes: 3

Related Questions