Reputation: 1548
According to this topic https://stackoverflow.com/questions/19384532/how-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by
I'd like to add one more stat - count null values (a.k.a. NaN) in DataFrame:
tdf = pd.DataFrame(columns = ['indicator', 'v1', 'v2', 'v3', 'v4'],
data = [['A', '3', pd.np.nan, '4', pd.np.nan ],
['A', '3', '4', '4', pd.np.nan ],
['B', pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan],
['B', '1', None, pd.np.nan, None ],
['C', '9', '7', '4', '0']])
I'd like to use something like this:
tdf.groupby('indicator').agg({'indicator': ['count']})
but with the addition of nulls counter to have it in separate column, like:
tdf.groupby('indicator').agg({'indicator': ['count', 'isnull']})
Now, I get error: AttributeError: Cannot access callable attribute 'isnull' of 'SeriesGroupBy' objects, try using the 'apply' method
How can I access this pd.isnull()
function here or use some with its functionality?
Expected output would be:
indicator nulls
count count
indicator
A 2 3
B 2 7
C 1 0
Note that pd.np.nan
works as None
in the same way.
Upvotes: 1
Views: 3445
Reputation: 1548
I've found almost satisfying answer myself: (cons: bit too complicated). In R for example I'd use RowSums
on is.na(df)
matrix. It's quite this way but more coding unfortunately.
def count_nulls_rowwise_by_group(tdf, group):
cdf = pd.concat([tdf[group], pd.isnull(tdf).sum(axis=1).rename('nulls')], axis=1)
return cdf.groupby(group).agg({group: 'count', 'nulls': 'sum'}).rename(index=str, columns={group: 'count'})
count_nulls_rowwise_by_group(tdf)
gives:
Out[387]:
count nulls
indicator
A 2 3
B 2 7
C 1 0
Upvotes: 0
Reputation: 862701
First set_index
and check all missing values with count by sum
and then aggregate count
with sum
:
df = tdf.set_index('indicator').isnull().sum(axis=1).groupby(level=0).agg(['count','sum'])
print (df)
count sum
indicator
A 2 3
B 2 7
C 1 0
Detail:
print (tdf.set_index('indicator').isnull().sum(axis=1))
indicator
A 2
A 1
B 4
B 3
C 0
dtype: int64
Another solution is use function with GroupBy.apply
:
def func(x):
a = len(x)
b = x.isnull().values.sum()
return pd.Series([a,b],index=['indicator count','nulls count'])
df = tdf.set_index('indicator').groupby('indicator').apply(func)
print (df)
indicator count nulls count
indicator
A 2 3
B 2 7
C 1 0
Upvotes: 1