OldMcFartigan
OldMcFartigan

Reputation: 129

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:

x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})

it yields this DataFrame:

int_var1    int_var2    cat_var value
 0           3           a       0.1
 1           4           b       0.2
 2           5           a       0.3

where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:

x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)

int_var1    int_var2    cat_var   value
 0           3           a         0.1
 1           4           b         0.2
 2           5           a         0.3

or I can show unobserved values for all grouping variables like so:

x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)

int_var1    int_var2    cat_var    value
 0           3           a          0.1
                         b          0.0
             4           a          0.0
                         b          0.0
             5           a          0.0
                         b          0.0
 1           3           a          0.0
                         b          0.0
             4           a          0.0
                         b          0.2
             5           a          0.0
                         b          0.0
 2           3           a          0.0
                         b          0.0
             4           a          0.0
                         b          0.0
             5           a          0.3
                         b          0.0

Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?

Upvotes: 0

Views: 171

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

You can unstack the level of interest, cat_var in this case:

(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
  .agg({'value':'sum'})
  .unstack('cat_var',fill_value=0)
)

Output:

                  value     
cat_var               a    b
int_var1 int_var2           
0        3          0.1  0.0
1        4          0.0  0.2
2        5          0.3  0.0

Upvotes: 1

Related Questions