Pandas groupby aggregation yields extraneous groups. Bug?

Question

Having some issues with data manipulation in pandas, seems like maybe a pandas bug? Would love some ideas.

I've got a (index-sorted) dataframe my_df that looks something like this:

                                                                          value1  value2
col0                                 col1             col2 col3     col4
0035ca76-b209-4c4e-9bba-b18459c4dceb 203          positive 173       148   0.0    0.086892
                                                  negative 1156      148   0.0    0.090347
                                                           1157      148   0.0    0.090347
                                                           1158      148   0.0    0.090347
                                                           1159      148   0.0    0.084884
                                                           1160      148   0.0    0.079942
                                                           1161      148   0.0    0.079824
                                                           1162      148   0.0    0.071289
                                                  positive 173       66    0.0    0.079831
                                                  negative 1156      66    0.0    0.082660
                                                           1157      66    0.0    0.082660
                                                           1158      66    0.0    0.082660
                                                           1159      66    0.0    0.084353
                                                           1160      66    0.0    0.076934
                                                           1161      66    0.0    0.076494
                                                           1162      66    0.0    0.070424
00e35aaf-050a-4f09-bf94-df994e4bf681 24           positive 14        38    0.0    0.073936
                                                  negative 134       38    0.0    0.075913
                                                           135       38    0.0    0.075913
                                                           136       38    0.0    0.074403
                                                           137       38    0.0    0.081120
                                                           138       38    0.0    0.078560
                                                           139       38    0.0    0.080680
                                                           140       38    0.0    0.073892
                                                  positive 14        1     0.0    0.051979
                                                  negative 134       1     0.0    0.043818
                                                           135       1     0.0    0.043818
                                                           136       1     0.0    0.049795
                                                           137       1     0.0    0.052171
                                                           138       1     0.0    0.048573
                                                           139       1     0.0    0.045205
                                                           140       1     0.0    0.054696
... more rows for this and other col0 + col1 combos

I'm trying to just compute the sum of "value2" for each unique combination of [col0, col1, col2, col3]. As far as I can tell, the most logical way to do this would be

my_df.groupby(level=list(range(4))).sum()

However, I'm getting really odd results that seem like a pandas bug.

        grouped = my_df.groupby(list(range(4)))
        for name, group in grouped:
            print(group)
            break
        sums = grouped.sum()

Indeed the first group is as I'd expect

                                                                            value1  value2
col0                                 col1             col2 col3     col4
0035ca76-b209-4c4e-9bba-b18459c4d681 199          positive 174       151    0.0     0.089186
                                                                     158    0.0     0.104250

and the number of groups in grouped is correct (you'll have to take my word for that, I've verified other ways) but sums is haywire and has a bajillion extraneous rows

(Pdb) len(grouped)
334
(Pdb) len(sums)
53760
(Pdb) sums[:30]
                                                                             value1     valu2
col0                                 col1         col2     col3      col4
1f11aede-6aed-44ef-9296-004b6269662c 17           positive 7         1       0.0        0.0
                                                                     4       0.0        0.0
                                                                     5       0.0        0.0
                                                                     6       0.0        0.0
                                                                     7       0.0        0.0
                                                                     8       0.0        0.0
                                                                     11      0.0        0.0
                                                                     12      0.0        0.0
                                                                     24      0.0        0.0
                                                                     32      0.0        0.0
                                                                     33      0.0        0.0
                                                                     38      0.0        0.0
                                                                     39      0.0        0.0
                                                                     53      0.0        0.0
                                                                     56      0.0        0.0
                                                                     66      0.0        0.0
                                                                     69      0.0        0.0
                                                                     70      0.0        0.0
                                                                     72      0.0        0.0
                                                                     73      0.0        0.0
                                                                     75      0.0        0.0
                                                                     85      0.0        0.0
                                                                     91      0.0        0.0
                                                                     94      0.0        0.0
                                                                     116     0.0        0.0
                                                                     119     0.0        0.0

The values given in col4 are widely varied throughout the dataframe. It looks like the groupby + aggregation op creates a sum row for every value of col4 in the entire dataframe, as opposed to just values of col4 that actually pertain to each group. In other words, most of these rows don't even have entries in the original dataframe:

(Pdb) my_df.loc[("1f11aede-6aed-44ef-9296-004b6269662c", 17, "positive", 7, 1)]
*** KeyError: ('1f11aede-6aed-44ef-9296-004b6269662c', 17, 'positive', 7, 1)

Any idea what's going on here? These seems totally off-script from what the groupby API and tutorial describe. For example, as far as I know groupby => agg should create one row per group here.

cydonian · Accepted Answer

TLDR: As of pandas 1.2.4, groupby has nonintuitive behavior when one of the indexes is Categorical. Fixed with groupby(..., observed=True).

Ok, this one took me a while. Turns out if one of your indexes is Categorical then groupby has, in my opinion, totally nonintuitive behavior.

# Does some sort of cartesian products of index values if any of the indexes are Categorical.
my_df.groupby(level=list(range(4)))
# Doesn't do the cartesian product, in line with behavior for every other index type.
my_df.groupby(level=list(range(4)), observed=True)

In my case, col2 was Categorical. That is, at some point in my code, I had:

col2_type = pd.api.types.CategoricalDtype(categories=["positive", "negative"], ordered=True)
col2 = ["positive", "negative", "negative", "negative", #...]
# Make col2 categorical
my_df['col2'] = my_df.assign(col2=col2)['col2'].astype(col2_type)

A bit of mea culpa on this one, this behavior is documented (see observed: ). That said, I'm not the only one who has been thoroughly confused by this.

With any luck, this "issue" will be fixed in an upcoming release.

Pandas groupby aggregation yields extraneous groups. Bug?

Answers (2)

Related Questions