codlix
codlix

Reputation: 898

Groupby class and count missing values in features

I have a dataframe like this

CLASS FEATURE1 FEATURE2 FEATURE3
  X      A       NaN      NaN
  X     NaN       A       NaN
  B      A        A        A

I want to group by the label CLASS and display the number of NaN values that are counted in every feature so that it looks like this. The purpose is to get a general idea how missing values are distributed over the different classes.

CLASS FEATURE1 FEATURE2 FEATURE3
  X      1        1        2
  B      0        0        0

I know how to receive the amount of nonnull values - df.groupby['CLASS'].count()

Is there something similar for the NaN values?

I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN

Upvotes: 40

Views: 32199

Answers (5)

Scott Boston
Scott Boston

Reputation: 153510

Update due to Future warning:

FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
df.set_index('CLASS').isna().sum(level=0)

df.set_index('CLASS').isna().groupby(level=0).sum()

You can use set_index and sum:

# Will be deprecated soon.. do not use. You should use above statement instead.
df.set_index('CLASS').isna().sum(level=0)

Output:

       FEATURE1  FEATURE2  FEATURE3
CLASS                              
X           1.0       1.0       2.0
B           0.0       0.0       0.0

Upvotes: 28

Adrien Pacifico
Adrien Pacifico

Reputation: 1999

A solution without using groupby, could be to use pivot_table and a custom aggregation function.
This solution may communicate the intent more clearly (at least to me) than the other solutions.

Generate the dataframe:

import pandas as pd
import numpy as np
df = pd.DataFrame({"CLASS":["X","X","B"],
                    "FEATURE1":["A", np.nan, "A",],
                    "FEATURE2":[np.nan,"A",  "A",],
                    "FEATURE3":[np.nan,np.nan,  "A",]
                  }
                 )

Generate the pivot table:

df.pivot_table(index="CLASS", 
               values=["FEATURE1","FEATURE2","FEATURE3"], 
               aggfunc= lambda x: x.isna().sum())
Out [2]:
       FEATURE1  FEATURE2  FEATURE3
CLASS                              
B             0         0         0
X             1         1         2

Performance:

Moreover, when looking at performance, this solution seems to be the fastest.

![enter image description here

Upvotes: 2

Adrien Pacifico
Adrien Pacifico

Reputation: 1999

Another solution (mostly for fun):

df.assign(
    **{col: df[col].isna() for col in df.columns if col not in "CLASS"},
).groupby("CLASS").sum()

Upvotes: 1

BENY
BENY

Reputation: 323356

Using the diff between count and size

g=df.groupby('CLASS')

-g.count().sub(g.size(),0)

          FEATURE1  FEATURE2  FEATURE3
CLASS                              
B             0         0         0
X             1         1         2

And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop

pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
Out[468]: 
   FEATURE1  FEATURE2  FEATURE3
B         0         0         0
X         1         1         2

Upvotes: 7

cs95
cs95

Reputation: 402902

Compute a mask with isna, then group and find the sum:

df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

  CLASS  FEATURE1  FEATURE2  FEATURE3
0     X       1.0       1.0       2.0
1     B       0.0       0.0       0.0

Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:

df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)

Or,

g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)

       FEATURE1  FEATURE2  FEATURE3
CLASS                              
B             0         0         0
X             1         1         2

There are quite a few good answers, so here are some timeits for your perusal:

df_ = df
df = pd.concat([df_] * 10000)

%timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
%timeit df.set_index('CLASS').isna().sum(level=0)    
%%timeit
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)

11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Actual performance depends on your data and setup, so your mileage may vary.

Upvotes: 38

Related Questions