Dtm
Dtm

Reputation: 46

What's the most pythonic way to count NaN's in a df agg() function?

I want to add a column after the agg() function with the number of non-NaNs. "len" does not distinguish between NaNs and non-Nans. In the example bellow, I am exploring the fact that np.sum and np.mean only use non-NaNs into account. But there must be a better way of reaching the same result?

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 1, 1, 0, 0, 0], 'b': [4, np.NaN, 2, 1, 6, np.NaN]})

df = df.groupby("a")['b'].agg([np.sum, np.mean, np.median, np.std, len])
df['count_non_nan'] = df['sum'] / df['mean']
df

Upvotes: 0

Views: 96

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Define a function counting non-null values in the source Series:

def countNotNull(grp):
    return grp.notnull().sum()

Don't be misguided by the name of notnull function. It takes into account also NaN values.

Then include this function in your aggregation:

df.groupby("a")['b'].agg([np.sum, np.mean, np.median, np.std, len, countNotNull])

The result is:

   sum  mean  median       std  len  countNotNull
a                                                
0  7.0   3.5     3.5  3.535534  3.0           2.0
1  6.0   3.0     3.0  1.414214  3.0           2.0

Upvotes: 0

Hashem
Hashem

Reputation: 37

Replace np.NaN's with 0.

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 1, 1, 0, 0, 0], 'b': [4, np.NaN, 2, 1, 6, np.NaN]})
df = df.replace(np.nan, 0)
df = df.groupby("a")['b'].agg([np.sum, np.mean, np.median, np.std, len])
df['count_non_nan'] = df['sum'] / df['mean']
df
print(df)

Upvotes: 1

Related Questions