Reputation: 898
I have a dataframe like this
CLASS FEATURE1 FEATURE2 FEATURE3
X A NaN NaN
X NaN A NaN
B A A A
I want to group by the label CLASS
and display the number of NaN values that are counted in every feature so that it looks like this. The purpose is to get a general idea how missing values are distributed over the different classes.
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
I know how to receive the amount of nonnull values - df.groupby['CLASS'].count()
Is there something similar for the NaN values?
I tried to subtract the count()
from the size()
but it returned an unformatted output filled with the value NaN
Upvotes: 40
Views: 32199
Reputation: 153510
Update due to Future warning:
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
df.set_index('CLASS').isna().sum(level=0)
df.set_index('CLASS').isna().groupby(level=0).sum()
You can use set_index
and sum
:
# Will be deprecated soon.. do not use. You should use above statement instead.
df.set_index('CLASS').isna().sum(level=0)
Output:
FEATURE1 FEATURE2 FEATURE3
CLASS
X 1.0 1.0 2.0
B 0.0 0.0 0.0
Upvotes: 28
Reputation: 1999
A solution without using groupby
, could be to use pivot_table
and a custom aggregation function.
This solution may communicate the intent more clearly (at least to me) than the other solutions.
Generate the dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({"CLASS":["X","X","B"],
"FEATURE1":["A", np.nan, "A",],
"FEATURE2":[np.nan,"A", "A",],
"FEATURE3":[np.nan,np.nan, "A",]
}
)
Generate the pivot table:
df.pivot_table(index="CLASS",
values=["FEATURE1","FEATURE2","FEATURE3"],
aggfunc= lambda x: x.isna().sum())
Out [2]:
FEATURE1 FEATURE2 FEATURE3
CLASS
B 0 0 0
X 1 1 2
Moreover, when looking at performance, this solution seems to be the fastest.
Upvotes: 2
Reputation: 1999
Another solution (mostly for fun):
df.assign(
**{col: df[col].isna() for col in df.columns if col not in "CLASS"},
).groupby("CLASS").sum()
Upvotes: 1
Reputation: 323356
Using the diff between count
and size
g=df.groupby('CLASS')
-g.count().sub(g.size(),0)
FEATURE1 FEATURE2 FEATURE3
CLASS
B 0 0 0
X 1 1 2
And we can transform this question to the more generic question how to count how many NaN
in dataframe with for loop
pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
Out[468]:
FEATURE1 FEATURE2 FEATURE3
B 0 0 0
X 1 1 2
Upvotes: 7
Reputation: 402902
Compute a mask with isna
, then group and find the sum:
df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()
CLASS FEATURE1 FEATURE2 FEATURE3
0 X 1.0 1.0 2.0
1 B 0.0 0.0 0.0
Another option is to subtract the size
from the count
using rsub
along the 0th axis for index aligned subtraction:
df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)
Or,
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)
FEATURE1 FEATURE2 FEATURE3
CLASS
B 0 0 0
X 1 1 2
There are quite a few good answers, so here are some timeits
for your perusal:
df_ = df
df = pd.concat([df_] * 10000)
%timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
%timeit df.set_index('CLASS').isna().sum(level=0)
%%timeit
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)
11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Actual performance depends on your data and setup, so your mileage may vary.
Upvotes: 38